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The  present  study  was  conducted  to  investigate  the  relationship  between 
student  achievement  outcomes  in  a district  Title  I program,  as  determined  by 
three  measurement  methods.  Correlation  matrices  were  used  to  investigate  the 
external  and  consequential  aspects  of  construct  validity  of  a portfolio-based 
assessment  in  relation  to  a norm-referenced  assessment  and  a 
performance-based  assessment,  with  subpopulations  defined  by  grade, 
ethnic/racial  group,  gender,  and  socioeconomic  status.  A second  concern  was 
to  determine  if  the  percentage  of  students  in  each  of  the  subgroups  meeting 
program  objectives  remained  consistent  across  measurement  method. 

Data  from  a sample  of  1 ,742  students  in  grades  3 to  5 enrolled  in  one 
Florida  district’s  Title  I program  were  analyzed.  Data  were  drawn  from  recent 
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administrations  of  the  Stanford  Achievement  Test,  a norm-referenced, 
standardized  achievement  test;  the  Florida  Writing  Assessment,  a controlled 
process  writing  sample  for  fourth-grade  students;  and  the  portfolio-based 
Student  Outcomes  Instrument  (SOI),  a summary  instrument  of  student 
proficiency  in  the  areas  of  reading,  writing,  and  mathematics. 

Internal  consistency  coefficients  for  several  features  of  portfolio  artifacts 
and  absolute  and  relative  standard  errors  of  measurement  were  estimated.  A 
cross-tabulation  of  percentage  of  students  in  each  subgroup  meeting  the 
proficiency  criteria  was  conducted  for  each  of  the  measurement  methods. 
Cohen’s  Kappa  was  calculated  as  an  index  of  the  decision  consistency  across 
measurement  methods.  A correlation  matrix  was  generated  for  the  SAT,  the 
writing  assessment,  and  the  SOI.  Convergent  and  discriminant  validity 
coefficients  were  examined  using  Steiger’s  z*  and  exploratory  factor  analysis. 

No  strong  patterns  of  inconsistency  among  subgroups  were  evident  for 
the  generalizability  coefficient  or  standard  errors.  However,  the  percentages  of 
students  classified  as  proficient  by  each  measurement  method  varied 
substantially,  particularly  when  performance  on  the  Florida  Writing  Assessment 
was  considered.  The  discrepancies  in  percentages  classified  as  proficient  by 
different  measurement  methods  occurred  consistently  across  subgroups. 
External  evidence  of  construct  validity  was  noted  in  the  positive  relationship  of 
performance  on  the  SOI  to  performance  on  the  SAT  and  the  Writing 
Assessment.  However,  analyses  of  convergent  and  discriminant  validity 
coefficients  yielded  questionable  evidence  of  construct  validity. 
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CHAPTER  1 
INTRODUCTION 

Since  its  inception  in  1965,  Title  I of  the  Elementary  and  Secondary 
Education  Act  and  its  1994  reauthorization  as  Title  I of  the  Improving  America’s 
Schools  Act  (lASA)  have  provided  support  for  supplemental  instruction  to 
America’s  disadvantaged  youth.  [Note:  Throughout  this  study,  unless  a specific 
reference  is  made,  the  program  is  referred  to  as  “Title  I”  although  it  has  also 
been  known  as  “Chapter  I”.]  This  program  is  the  federal  government’s  largest 
single  kindergarten-through-grade-12  education  expenditure,  accounting  for 
approximately  one-fifth  of  the  United  States  Department  of  Education’s  total 
budget.  Since  1965,  Congress  has  appropriated  over  $80  billion  to  districts 
with  high  concentrations  of  low-income  families  for  Title  I,  including  $6.7  billion 
in  fiscal  year  1995  to  serve  5.5  million  children  (Capitol  Publications,  1995; 
United  States  Department  of  Education,  1993b).  Funds  are  targeted  at  schools 
with  the  greatest  concentrations  of  low-income  students,  and  within  these 
schools  students  are  selected  into  the  Title  I program  based  on  academic  need 
and  configuration  of  the  Title  I program  at  the  particular  site.  The  program  has 
been  credited  with  (a)  narrowing  the  achievement  gap  between  disadvantaged 
students  and  their  more  advantaged  peers,  (b)  increasing  parental  involvement 
in  education,  and  (c)  furthering  the  field  of  educational  evaluation  and 
assessment  (United  States  Department  of  Education,  1993c). 
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According  to  Ligon  (1993),  “Title  I set  the  pace  [for  educational  program 
evaluation]  because  of  the  resources  provided  and  the  mandates  imposed”  (p. 
1).  The  required  use  of  norm-referenced  standardized  achievement  tests  in 
Title  I evaluation  provided  a major  impetus  to  widespread  use  of  this  type  of 
testing  in  schools  throughout  the  nation  in  the  late  1960s  (Mehrens  & Lehman, 
1969).  The  Title  I Evaluation  and  Reporting  System  (TIERS)  developed  in  the 
mid-1970s  required  that  all  Title  I students  in  grades  2 and  above  be 
administered  a nationally  normed  test,  for  pre-  to  posttest  Normal  Curve 
Equivalent  (NCE)  gain  scores  in  both  basic  and  advanced  skills  to  be 
calculated  (United  States  Department  of  Education,  1991,  p.  125).  These  data 
were  aggregated  by  grade  level,  model  of  delivery  (e.g.,  in-class,  pullout, 
schoolwide,  etc.),  school,  district,  state,  and  nation.  With  the  reauthorization 
legislation  of  1994,  however,  there  have  been  substantial  changes  to  the 
expectations  for  Title  I evaluation. 

In  keeping  with  the  reinvention  of  the  Title  I program,  there  was  a 
mandate  to  reexamine  the  program’s  effectiveness  with  methods  that  would 
include 

the  development  of  alternative  assessments  that  reflect  the  current 
directions  and  priorities  in  curriculum  and  instructional  reform  by 
building  on  and  supporting  the  work  of  selected  state  departments, 
local  school  districts,  and  other  research  and  development 
agencies.  (Estes,  1993,  p.  1 1) 

In  addition,  with  the  1994  reauthorization,  the  Independent  Review  Panel  of  the 
National  Assessment  of  Title  I called  for  replacing  the  previous  evaluation 
system  with  one  that  would  serve  three  broad  functions:  (a)  to  serve  as  a 
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national  evaluation  of  Title  I schools  and  students,  (b)  to  serve  as  a measure  of 
school  progress  and  accountability,  and  (c)  to  provide  information  about 
individual  students  for  teachers  and  parents.  In  particular,  the  1994  law 
prescribed  that 

A State  shall  develop  or  adopt  challenging  content  and  student 
performance  standards  that  will  be  used  by  the  State,  its  LEAs,  and  its 
schools  to  carry  out  this  subpart.  Standards  under  this  subpart  must 
include-- 

(i) Challenging  content  standards  in  academic  subjects  that-- 

(A) Specify  what  children  are  expected  to  know  and  be  able  to  do; 

(B) Contain  coherent  and  rigorous  content;  and 

(C) Encourage  the  teaching  of  advanced  skills;  and 

(ii) Challenging  student  performance  standards  that-- 

(A) Are  aligned  with  the  State’s  content  standards; 

(B) Describe  two  levels  of  high  performance-proficient  and  advanced- 
that  determine  how  well  children  are  mastering  the  material  in  the 
State’s  content  standards;  and 

(C) Describe  a third  level  of  performance-partially  proficient-to  provide 
complete  information  to  measure  the  progress  of  lower-performing 
children  toward  achieving  to  the  proficient  and  advanced  levels  of 
performance.  (Improving  America’s  Schools  Act  of  1994,  Pub.  L.  No. 
Public  Law  No.  103-382,  §Title  I,  §1111.2,  1994) 

Furthermore,  the  law  also  called  for  the  use  of  “multiple  up-to-date  measures  of 

student  performance,  including  measures  that  assess  higher  order  thinking 

skills  and  understanding”  (Improving  America’s  Schools  Act  of  1994,  Pub.  L. 

No.  103-382,  §Title  I,  §1111.3,  1994).  Taken  together,  these  evaluation 

requirements  constitute  a radical  departure  from  the  norm-referenced, 

standardized  test  approach  that  has  been  used  to  evaluate  the  effectiveness  of 

Title  I programs  at  the  local  district  level  in  the  past.  Namely,  districts  and  states 

must  now  implement  procedures  to  assess  student  progress  using  multiple 

forms  of  assessments  and  set  performance  standards  in  those  assessments. 
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To  date,  there  has  been  limited  study  of  the  relationship  of  alternative 
assessments  to  MC-NRTs  and  their  role  in  the  evaluation  of  Title  I programs.  In 
one  of  the  first  published  investigations,  Buttram  and  McCann  (1993) 
investigated  the  relationship  of  NCE  gains  in  the  areas  of  reading  and 
mathematics  to  teachers’  grades  on  reading  and  mathematics,  reading  level 
(based  on  end-of-year  basal  level  in  which  the  student  was  working),  and  an 
exit  indicator  (whether  or  not  the  student  would  remain  eligible  for  Title  I 
services  during  the  following  school  year,  based  on  absolute  performance  on 
the  NRT).  They  concluded  that  “schools  frequently  demonstrated  significant 
increases  on  one  of  the  indicators  while  producing  significant  decreases  on  the 
other”  (p.  4)  and  that  “before  the  process  for  evaluating  Title  I programs  is 
altered,  much  more  discussion  and  solid  evidence  about  potential  alternatives 
will  be  needed”  (p.  5). 

Currently,  three  types  of  indicators  of  student  achievement  seem  to  be 
viable  for  future  use  with  Title  I evaluations:  (a)  the  traditional  objective, 
standardized  achievement  test,  (b)  “on-demand,”  standardized  performance 
assessments  developed  and  graded  by  external  assessors,  and  (c)  classroom 
assessment  systems,  typically  based  on  portfolios  or  samples  of  student  work 
elicited  within  the  natural  classroom  instructional  environment. 

Unfortunately,  the  reauthorization  legislation  for  Title  I occurred  before 
the  information  needed  to  evaluate  assessment  alternatives  could  be 
martialled.  This  has  left  policy-makers,  program  specialists,  and  educators  who 
administer  Title  I programs  with  the  need  to  create  new  program  evaluation 
guidelines  without  an  adequate  base  of  experience  or  empirical  research  to 
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guide  policy  and  practice.  Baker  and  Linn  (1996)  have  identified  the  concerns 
surrounding  the  use  of  multiple  measures  as  one  of  the  most  complex 
challenges  of  implementing  evaluations  of  the  new  Title  I programs. 

Description  of  This  Study 

Purpose 

The  purpose  of  this  study  was  to  investigate  the  relationship  between 
student  achievement  outcomes  in  a Title  I program  at  the  district  level,  when 
achievement  was  measured  by  (a)  a portfolio-based  classroom  assessment 
system,  (b)  the  more  traditional  MC-NRT,  and  (c)  a state-level  performance 
assessment  in  writing.  The  specific  focus  of  the  study  was  the  relationship  of 
the  portfolio  performance  to  the  other  measures.  Data  for  this  study  were 
obtained  during  1995-1996  school  year  in  a middle-sized  southeastern 
rural/suburban  school  district.  These  data  reflected  the  performance  of  students 
in  grades  3,  4,  and  5 in  the  areas  of  reading,  writing/language  arts,  and 
mathematics.  In  contrast  to  Buttram  and  McCann’s  (1993)  study  in  which 
teacher  grading  practices  were  uncontrolled,  this  study  focused  on  the  use  of  a 
district-developed,  standardized  assessment  instrument  based  on  portfolio 
contents.  Teacher  ratings  on  this  portfolio  assessment  instrument  were  related 
to  student  performance  on  both  the  district’s  MC-NRT  and  the  state’s 
performance  assessment  in  writing. 

Research  Questions 

The  specific  research  questions  addressed  were  as  follows  for  each 


grade  level  and  subject  area: 
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1 . What  is  the  internal  consistency  of  the  teacher  ratings  of  scored 
features  of  student  artifacts  in  a standardized  portfolio  assessment? 

2.  Do  the  internal  consistency  coefficients  of  the  portfolio-based  ratings 
vary  significantly  across  grade  level,  ethnic  group,  gender  group,  and 
socioeconomic  group?  How  do  the  absolute  and  relative  standard  errors  of 
measurement  vary? 

3.  What  proportion  of  students  are  consistently  classified  as  proficient  or 
not  proficient,  based  on  each  possible  pair  of  the  three  assessment  methods? 

4.  Does  the  consistency  of  classification  on  different  pairs  of  assessment 
methods  vary  across  grade  level,  ethnic  group,  gender  group,  and 
socioeconomic  group? 

5.  What  is  the  pattern  of  convergent  and  discriminant  validity  coefficients 
between  the  MC-NRT,  a state-level  performance  assessment  (writing),  and  the 
portfolio-based  instrument  in  subject  areas  of  reading,  mathematics,  and 
language  arts? 

6.  Controlling  for  class  mean  level  of  performance,  what  is  the  pattern  of 
convergent  and  discriminant  validity  coefficients? 

Significance  of  this  Study 

With  the  1994  reauthorization  of  the  Title  I program,  a great  deal  of 
attention  was  focused  on  the  development  of  new  assessment  formats  that 
were  “better  aligned  with  local  and  state  curricula,  give  students  tasks  with  real 
world  value,  and  yield  information  about . . . reasoning  processes”  (United 
States  Department  of  Education,  1993a,  p.  8).  Currently,  efforts  are  under  way 
at  the  state  level  to  establish  accountability  systems  that  fit  the  above 
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description  and  that  include  “multiple  up-to-date  measures”  as  required  by  law 
(Improving  America’s  Schools  Act  of  1994,  Pub.  L.  No.  Public  Law  No.  103-382, 
§Title  I,  § 1 1 1 1 (a)(3)(E)).  Furthermore,  because  individual  districts  may  choose 
to  use  additional  measures  or  indicators  for  the  purposes  of  evaluation 
(Improving  America’s  Schools  Act  of  1994,  Pub.  L.  No.  Public  Law  No.  103-382, 
§Title  I,  § 1 1 16(a)(2)),  many  districts  have  focused  their  attention  on  the 
development  of  new  alternative  forms  of  assessment  for  the  purposes  of 
program  evaluation. 

But  how  well  do  these  new  assessment  formats  measure  what  they 

purport  to  measure?  According  to  Shepard  (1992a), 

Such  measures  can  be  scored  reliably  enough  to  provide 
accurate  accountability  evidence.  . . the  reason  policy  makers 
should  be  willing  to  invest  in  performance  assessment  is,  again, 
not  just  because  it  will  yield  more  valid  data  but  because  with  the 
right  kinds  of  tasks,  it  will  lead  educational  reform  in  the  right 
direction,  (pp.  325-326) 

Prior  to  the  present  study,  there  has  been  limited  effort  to  examine  the 
relationship  between  the  varying  methodological  options  for  evaluating  Title  I 
programs  called  for  in  the  reauthorization.  To  date  in  the  context  of  Title  I,  there 
has  been  no  investigation  of  the  relationship  between  student  achievement 
results  on  MC-NRTs  and  portfolio-based  assessment  with  specified  student 
proficiency  levels  in  multiple  academic  domains.  If  the  new  assessment  formats 
are  not  reliable  or  if  the  consequences  for  individual  students  differ  depending 
on  type  of  instrument  used,  then  this  may  raise  questions  in  terms  of  the  legal 
issues  of  equity  and  equal  access  to  services.  Furthermore,  if  decisions  about 
the  effectiveness  of  school  Title  I programs  vary  as  a function  of  the  assessment 
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method,  this  could  have  important  implications  for  Title  I resource  allocations, 
instructional  strategies,  program  improvement,  student  selection  into 
supplemental  Title  I programs,  and  the  way  in  which  Title  I programs  are 
evaluated  at  school  and  district  levels.  It  is  within  this  framework  that  the 
research  questions  of  the  current  study  were  proposed. 

This  district  provided  a unique  opportunity  for  research  on  this  topic,  as  it 
has  completed  its  sixth  year  of  implementing  a portfolio  program  at  the 
elementary  level,  as  well  as  the  use  of  a district-developed  portfolio-based 
instrument,  the  Student  Outcomes  Instrument,  as  an  alternative  means  of 
measuring  achievement  of  students  receiving  Title  I program  services.  Thus, 
the  system  has  been  implemented  for  a sufficient  time  to  have  allowed  teacher 
familiarization,  inservice  training  with  the  method,  and  possible  curricular 
alignments  to  have  occurred.  Furthermore,  because  this  district  is  regarded  as 
a “leader”  in  the  state  in  development  of  classroom  portfolio  assessments,  its 
system  is  likely  to  be  emulated  or  replicated  in  other  districts  throughout  the 
state. 

Finally,  the  study  demonstrates  a specific  application  of  several  aspects 
of  construct  validation  for  performance  assessments  recently  set  forth  by 
Messick  (1995).  Messick  suggested  the  use  of  a comprehensive  approach  to 
validity  that  integrated  considerations  of  content,  criteria,  and  consequences 
into  a construct  framework  for  empirically  testing  hypotheses  about  score 
meaning  and  utility.  Specifically,  he  has  developed  a construct  framework  of  six 
aspects  of  general  validity  criteria:  content,  substantive,  structural, 
generalizability,  external,  and  consequential  aspects  of  validation.  The 
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questions  addressed  in  this  study  focus  on  the  fourth,  fifth,  and  sixth  aspects  of 
Messick’s  approach  to  validation,  namely, 

4.  The  generalizability  aspect  examines  the  extent  to  which 
score  properties  and  interpretations  generalize  to  and  across 
population  groups,  settings,  and  tasks,  including  validity 
generalization  of  test-criterion  relationships. 

5.  The  external  aspect  includes  convergent  and  discriminant 
evidence  from  multitrait-multimethod  comparisons,  as  well  as 
evidence  of  criterion  relevance  and  applied  utility. 

6.  The  consequential  aspect  appraises  the  value  implication  of 
score  interpretation  as  a basis  for  action  as  well  as  the  actual 
and  potential  consequences  of  test  use,  especially  in  regard 
to  sources  of  invalidity  related  to  issues  of  bias,  fairness,  and 
distributive  justice,  (pp.  3-4) 

This  study  can  provide  a useful  demonstration  of  application  of  some  aspects  of 
Messick’s  theoretical  conception  of  validation  in  a critical  area  of  educational 
evaluation. 


Overview  of  the  Dissertation 

This  study  was  designed  to  investigate  the  relationship  between  the  use 
of  a MC-NRT,  the  state’s  writing  assessment,  and  a portfolio-based,  classroom 
assessment  in  the  evaluation  of  the  Title  I program  in  a middle-sized 
suburban/rural  school  district  in  central  Florida.  The  study  involved  seven  Title  I 
elementary  schoolwide  programs.  Achievement  data  were  collected  on  the 
Stanford  Achievement  Test,  Eighth  Edition  (SAT8)  and  the  district-developed 
Title  I Student  Outcomes  Instrument  ior  grades  3 through  5 students  in  the 
reading,  writing/language  arts,  and  mathematics  domains;  in  addition,  the 
Florida  Writing  Assessment  was  included  in  the  analysis  at  grade  4. 

Further  discussion  of  empirical  research  regarding  the  use  of 
performance-based  and  portfolio  assessments  for  accountability  purposes  as 
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well  as  an  historical  analysis  of  the  Title  I program  is  presented  in  the  review  of 
literature,  Chapter  2.  In  Chapter  3,  a description  of  the  methodology  for 
conducting  this  research  study  can  be  found.  The  results  of  the  data  analysis 
are  presented  in  Chapter  4.  A discussion  of  the  results  and  implications  for 
further  study  and  practice  are  presented  in  Chapter  5. 


CHAPTER  2 

REVIEW  OF  THE  LITERATURE 

Several  bodies  of  conceptual  and  empirical  literature  provide  the 
framework  for  the  investigation  of  the  utility  of  performance  and  portfolio 
measures  in  Title  I evaluation.  One  pertinent  strand  of  literature  addresses 
writings  pertaining  to  the  history  of  the  Title  I program  and  the  role  of 
achievement  tests  in  its  evaluation.  A second  set  of  relevant  literature  focuses 
on  the  use  of  alternative  assessments  in  the  broad  arena  of  program  evaluation. 
Finally,  empirical  research  studies  pertaining  to  the  validity  of  performance 
assessments  are  reviewed. 

Title  I:  History  and  Role  of  Evaluation 
On  April  11,  1965,  Title  I of  Public  Law  89-10,  the  Elementary  and 
Secondary  Education  Act  (ESEA),  which  authorized  $1  billion  for  compensatory 
education  programs,  was  signed  into  law  (Alford,  1965).  It  was  the  first  broad 
federal  aid  bill  for  elementary  and  secondary  education  in  the  nation’s  history 
and  central  to  President  Lyndon  B.  Johnson’s  Great  Society  policies  which 
were  designed  to  eliminate  poverty  and  equalize  economic  opportunity  (Kantor, 
1991).  At  the  time  of  its  inception  and  the  year  that  followed,  it  was  largely 
viewed  as  having  a profound  impact  on  every  aspect  of  education  (Wayson, 
1966). 
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According  to  Stickney  and  Marcus  (1985)  and  Miller  (1991),  ESEA  was 
originally  proposed  by  President  Kennedy  in  1961  as  the  School  Assistance 
Act,  which  met  with  great  opposition  from  the  Catholic  church  (because  funds 
were  aimed  at  public  schools  only)  and  southern  states  (because  funds  would 
be  withheld  from  segregated  school  districts).  Again  in  1962,  Kennedy 
attempted  unsuccessfully  to  have  the  bill  passed.  In  1963,  he  proposed  an 
omnibus  education  bill.  The  higher  education  and  vocational  education 
portions  were  enacted  into  law  shortly  after  his  assassination,  but  the 
elementary  and  secondary  portions  did  not  come  up  for  a vote  at  all  during 
1963. 

Stickney  and  Marcus  (1985)  cited  two  factors  which  were  needed  to 
bring  about  passage  of  ESEA.  The  first  was  President  Johnson’s  landslide 
victory  in  the  1964  elections,  “which  swelled  the  congressional  ranks  of 
northern  and  western  liberals  and  provided  a mandate  for  the  reform.”  The 
second  factor  was  “Johnson’s  skill  in  the  art  of  compromise”  (p.  561).  The  issue 
of  segregation  was  resolved  with  the  enactment  of  the  Civil  Rights  Act  of  1 964, 
which  barred  any  appropriation  of  federal  funds  to  segregated  schools.  The 
issue  of  private  school  student  participation  was  resolved  with  the  targeting  of 
funds  at  individual  children  rather  than  schools,  which  allowed  private  school 
students  to  participate  in  Title  I programs. 

The  original  Title  I focused  on  the  improvement  of  deficits  in  the  basic 
academic  skills  through  the  provision  of  teaching  specialists,  aides,  special 
training,  new  equipment  and  materials,  special  preschool  and  summer 
programs,  student  centers,  and  tutoring  services  (First  Annual  Report,  Title  I, 


13 


Elementary  and  Secondary  Education  Act  of  1965,  in  Plunkett,  1985,  p.  535). 
Subsequent  amendments  to  the  original  law  in  1972,  1978,  1981,  1988,  and 
1994  contained  clearer  language  of  the  intent  of  congress,  including  elements 
pertaining  to  (a)  program  objectives,  (b)  coordination  with  the  regular  classroom 
program  and  with  other  special  programs,  (c)  parental  involvement,  (d) 
dissemination  to  Title  I staff  of  research  data  and  information  on  promising 
practices,  and  (e)  dissemination  to  parents  and  the  community  of  information 
about  the  Title  I program  (Plunkett,  1985,  p.  535). 

A common  strand  found  throughout  each  of  the  reauthorizations  is  the 
specification  that  programs  funded  under  Title  I must  be  “of  sufficient  size,  scope 
and  quality  to  give  reasonable  promise  of  substantial  progress  toward  special 
educational  needs  of  the  children  being  served”  (Plunkett,  1991  p.  339).  In 
terms  of  program  evaluation,  the  amendments  of  1988  led  to  an  increased 
emphasis  on  evaluation  results  in  the  form  of  program  improvement 
requirements  (Jennings,  1991). 

Since  its  inception,  the  impetus  for  a reporting  mechanism  to  assess  the 
effectiveness  of  the  Title  I program  has  gained  momentum  and  promoted  the 
evaluation  and  accountability  movements  in  education  in  general  (Berk,  1981; 
David,  1981;  Halperin,  1975).  Originally  suggested  by  Senator  Robert 
Kennedy,  the  requirement  for  “evaluating  at  least  annually  the  effectiveness  of 
the  programs”  (Sec.  205[a][5],  P.L.  89-10  as  cited  in  Davis,  1991,  p.  380)  was 
contained  both  in  the  original  law  and  expanded  upon  in  each  subsequent 
reauthorization.  Because  of  its  accountability  requirements.  Title  I has  been 
viewed  as  driving  the  amount  of  testing  that  has  been  conducted  across  the 


nation.  Noll  (1965,  p.  385)  outlined  a recommended  minimum  annual  testing 
program  for  public  schools  at  that  time  as  follows: 
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Grade  Level 


Type  of  Test 


Korl 

2 

3 or  4 
3 or  4 
6 or  8 
6 or  7 


Achievement  Battery  (including  reading) 
Intelligence 

Achievement  Battery  (including  reading) 
Interests 


Readiness 

Intelligence 

Intelligence 


9 

10 

12 


Intelligence 

Interests 


As  can  be  seen  above,  prior  to  the  passage  of  Title  I of  ESEA  in  1965,  the 
suggested  annual  testing  program  emphasized  intelligence  testing  at  grades  2, 
3 or  4,  6 or  8,  and  10.  Achievement  batteries  were  suggested  for  grades  3 or  4 
and  grades  6 or  7.  With  the  implementation  of  Title  I,  the  use  of  standardized 
achievement  tests  in  public  schools  increased  dramatically.  Section  205  of 
Public  Law  89-10  called  for 

effective  procedures  including  provisions  for  appropriate  objective 
measures  of  educational  achievement  will  be  adapted  for 
evaluating  at  least  annually  the  effectiveness  of  the  program  in 
meeting  the  special  educational  needs  of  educationally  deprived 
children.  (Halperin,  1975,  p.  8) 

According  to  Mehrens  and  Lehmann  (1969),  these  “appropriate  . . . 
measures  of  educational  achievement”  were,  for  the  most  part,  standardized 
tests  (p.  5).  Mehrens  and  Lehmann  reported  the  number  of  test  booklets  and 
answer  sheets  sold  in  the  United  States  from  1955  through  1967,  showing  an 
increase  in  sales  from  1964  to  1965  and  1965  to  1966  following  the  passage  of 
Title  I.  These  figures  are  reported  in  Table  1. 
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Table  1 

Net  Sales  of  Standardized  Tests  and  Answer  Sheets 


Year 

K-12 

Total 

1955 

83,800,000 

1956 

91,070,000 

1957 

97,810,000 

1958 

109,710,000 

1959 

133,620,000 

1960 

122,650,000 

140,750,000 

1961 

123,820,000 

141,290,000 

1962 

125,520,000 

146,630,000 

1963 

122,680,000 

146,710,000 

1964 

122,300,000 

149,100,000 

1965 

132,020,000 

163,930,000 

1966 

169,990,000 

205,070,000 

1967 

153,830,000 

188,710,000 

Mehrens  and  Lehmann  (1969)  pointed  out  that  the  decrease  from  1966 

to  1967  may  have  been  due  in  part  either  to 

(a)  reuse  of  test  booklets  purchased  earlier,  (b)  over  sale  of 
booklets  and  answer  sheets  in  1 964  and  1 965,  (c)  a decrease  in 
external  funds  to  purchase  test  supplies,  and  (d)  the  possibility  that 
schools  may  be  illegally  reproducing  answer  sheets  with  their  own 
equipment,  (p.  5) 
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In  the  original  law,  a specific  evaluation  approach  was  not  prescribed.  In 
1969,  the  publication  of  a report  entitled  Title  I of  ESEA:  Is  it  helping  poor 
children?  (McClure  & Martin,  as  cited  in  Miller,  1991)  provided  the  impetus  for 
changes  in  the  Title  I program  by  revealing  numerous  perceived  shortcomings 
of  the  program,  including 

districts  spending  money  on  frivolous  purchases  or  ineligible 
schools,  poorly  planned  and  executed  programs,  inadequate  state 
oversight,  reluctant  and  timid  federal  management,  exclusion  of 
poor  people  from  program  planning,  and  a lack  of  information 
disseminated  to  parents  regarding  Title  I (p.  C3) 

The  publication  of  this  report  led  to  a tightening  of  federal  and  state  program 

supervision  in  the  1970s,  with  stricter  guidelines  on  monitoring  and  testing 

requirements. 

The  Title  I Evaluation  and  Reporting  System 

When  Title  I was  reauthorized  in  1974,  requirements  for  evaluation  were 
more  concise,  calling  for  objective  criteria  and  a methodology  for  producing 
data  which  would  be  comparable  on  both  a statewide  and  nationwide  basis 
(Sec.  151  [f],  P.L.  93-380,  as  cited  in  Davis,  1991,  p.  380). 

According  to  Cross  (1979),  a little-known  Republican  member  of  the 
House  of  Representatives,  Representative  Victor  Veysey  of  California,  authored 
section  151  which  required  a description  of  program  goals  and  methods  for 
evaluating  those  goals.  Here,  the  requirements  of  Section  151  are  presented  in 
a framework  suggested  by  Barnes  and  Ginsburg  (1979):  (a)  publish  standards 
for  the  evaluation  of  project  effectiveness,  (b)  provide  models  for  evaluations 
which  include  uniform  procedures  and  criteria  to  be  utilized  by  the  State 
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Education  Agency  (SEA)  and  the  Local  Education  Agency  (LEA),  (c)  provide 
technical  assistance  to  enable  SEAs  to  assist  LEAs  in  carrying  out  the 
evaluation  of  programs  in  accordance  with  the  models,  (d)  specify  objective 
criteria  and  techniques  for  producing  comparable  state  and  national  data,  and 
(e)  develop  a system  for  disseminating  evaluation  results,  including  identified 
exemplary  programs. 

In  June  of  1974,  RMC  Research  Corporation  was  awarded  a contract  to 
begin  the  development  of  the  summative  evaluation  models  that  were  to 
accompany  the  new  reporting  system.  By  1976,  the  Title  I Evaluation  and 
Reporting  System  (TIERS)  had  been  developed.  According  to  Tallmadge  and 
Wood  (1976),  the  objective  of  TIERS  was  “to  provide  meaningful,  comparable 
information  about  Title  I projects  at  the  school  building,  school  district,  state,  and 
federal  level”  ( p.  1).  Information  pertaining  to  program  participation,  parent 
advisory  councils,  personnel,  training,  cost,  and  program  impact  was  collected. 
Training  was  provided  to  states  and  local  districts,  with  regional  Technical 
Assistance  Centers  established  to  assist  states  in  adopting  the  evaluation 
models. 

The  TIERS  system  was  based  upon  the  Normal  Curve  Equivalent  (NCE), 
which  is  a normalized  standard  score  with  a mean  of  50  and  standard  deviation 
of  21 .06  (Linn,  1 981 , p.  93).  The  underlying  principle  of  this  design  was  the 
assumption  that,  without  program  interventions,  the  treatment  group  would 
maintain  its  status  relative  to  a national  norm  group  from  pretest  to  posttest. 

That  is,  no  gain  in  NCE  score  would  be  observed  (Tallmadge  & Wood,  1976,  p. 
4).  The  system  consisted  of  three  basic  models,  all  with  the  primary  focus  on 
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the  question  of  “how  much  more  did  pupils  learn  by  participating  in  the  Title  I 

project  than  they  would  have  learned  without  it”  (p.  2).  According  to  Echternacht 

(1980),  the  models  may  be  described  as  follows: 

Model  A:  The  norm-referenced  model,  was  the  most  commonly 
used.  It  was  a pretest-posttest  design  with  growth  judged  in  terms 
of  gain  from  pretest  to  posttest.  National  norms  acted  as  a 
surrogate  comparison  group. 

Model  B:  The  control  group  model,  used  either  the  idealized 
randomized  design  or  the  nonequivalent  control  group  design  . 

Either  the  Analysis  of  Covariance  or  standardized  change  score 
analysis  was  used  to  estimate  treatment  effects.  This  model  was 
rarely  used  as  it  required  withholding  Title  I services  from  students 
who  would  normally  be  expected  to  receive  such  services,  so  that 
a control  group  would  remain  intact. 

Model  C:  The  special  regression  model,  although  most 
statistically  sound,  was  difficult  for  evaluators  to  use  due  to  a lack 
of  access  to  computers  at  the  time.  The  treatment  effect  was 
estimated  as  follows: 


Yc  - Yr  b(Xc  - Xf) 


where  c and  t refer  to  the  comparison  and  treatment  groups 
respectively,  and  b is  a regression  coefficient  estimated  from  only 
the  comparison  group.  This  model  required  that  the  pretest  be 
used  for  program  selection,  with  a stringent  cutoff,  (pp.  5-9) 

Within  each  of  these  models,  there  were  two  versions-one  version  based  the 

comparisons  on  normed  tests,  the  other  version  used  nonnormed  tests. 

Criticisms  of  TIERS 

Much  criticism  surrounded  the  TIERS  system,  based  in  part  on  concerns 
about  its  technical  adequacy,  the  tremendous  burden  placed  upon  LEAs  for 
evaluation  and  the  relative  lack  of  utility,  and  the  comparability  of  evaluation 
results  across  the  models  (Echternacht,  1980).  It  was  the  intent  of  the  model 
designers  that  additional  locally  relevant  data  also  be  collected.  However,  the 
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burden  to  comply  with  the  requirements  of  the  TIERS  system  made  this  nearly 
impossible.  There  is  a substantial  body  of  documenting  criticisms  and 
limitations  of  the  TIERS  system  (e.g.,  Davis,  1991;  Estes,  1993;  Jaeger,  1979; 
Linn,  1979;  Stonehill  & Groves,  1983;  vanderPloeg,  1982;  Wiley,  1979;  Wisler  & 
Anderson, 1979). 

Beginning  in  the  late  1970s,  there  was  a persistent  view  among 
measurement  experts  and  other  educators  that  the  TIERS  model  was  not 
adequately  meeting  their  needs  in  evaluating  the  Title  I program.  Jaeger  (1979) 
discussed  the  issue  of  aggregating  NCE  scores  across  Title  I projects  and  the 
large  measurement  error  stemming  from  the  use  of  different  tests.  He  proposed 
an  alternative  method  to  evaluation,  which  was  based  on  well-defined  global 
categories  of  project  impact  with  descriptors  like  highly  effective,  effective, 
ineffective,  and  aversive.  This  framework  was  similar  to  the  evaluation  system 
that  would  later  be  contained  in  the  1994  reauthorization  of  the  law. 

Ligon  (1991)  contended  that  regression  toward  the  mean,  floor  and 
ceiling  effects  in  the  tests,  attrition,  test-curriculum  mismatches,  low  reliability  of 
measures  at  prekindergarten  and  other  early  grades,  norming  periods  and 
testing  dates  that  mismatched  with  the  beginning  and  ending  of  program 
interventions,  norming  years  that  were  too  infrequent  to  reflect  current  levels  of 
national  achievement,  multiple  programs  impacting  the  same  students,  variable 
levels  of  service  for  individual  students,  outliers  with  very  large  gains  or  losses, 
cheating,  and  the  cumulative  effect  of  annual  gains  that  raise  targets  were  some 
of  the  problems  associated  with  the  system  (p.  389). 
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Voicing  similar  concerns,  Clayton  (1991)  cited  several  assumptions  upon 
which  the  norm-referenced  model  was  based,  which  may  or  may  not  have  been 
met  in  practice,  including  that 

(a)  the  population  being  tested  is  stable  enough  to  represent  a fair 
test  of  a school’s  Title  I program,  (b)  academic  growth  of  a student 
or  school  is  linear,  (c)  one  year  is  time  enough  for  making 
significant  change,  and  (d)  changing  demographic  conditions  of  a 
school  need  not  be  taken  into  account,  (p.  349) 

Clayton  also  described  the  effect  that  regression  to  the  mean  has  on  Title  I 

evaluation  and  that  schools  with  high  pretest  scores  are  more  likely  than  those 

with  low  pretest  scores  to  be  identified  for  program  improvement  (p.  350). 

To  determine  the  effect  of  regression  to  the  mean  on  evaluation  results, 

Slavin  and  Madden  (1991)  studied  a school  in  which  the  early  intervention 

program  Success  for  All  had  been  implemented.  They  concluded  that  the 

school  was  undermining  its  Title  I program  by  showing  great  gains  during  the 

year  in  which  students  participated  in  Success  for /A// followed  by  somewhat 

limited  gains  in  later  grades.  Finally,  Shepard  (1992b)  contended  that  in 

addition  to  the  problem  of  regression  toward  the  mean,  the  emphasis  on 

standardized  test  scores  had  narrowed  the  curriculum  to  include  only  the  areas 

which  would  be  assessed  on  the  test.  This  criticism  was  to  become  one  of  the 

most  compelling  forces  in  shaping  the  evaluation  sections  of  the  Title  I 

reauthorization  in  1994. 

The  Impact  of  Title  I on  Student  Achievement 

Since  the  implementation  of  Title  I programs,  numerous  studies  have 

been  designed  to  measure  the  impact  of  Title  I on  student  achievement.  Forbes 
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(1985)  found  that  during  the  1970s  the  achievement  of  historically  lower 

achieving  students  improved.  However,  this  improvement  might  be  attributed  to 

many  factors,  including  Title  I,  desegregation,  and  the  movement  to  improve 

public  education.  Slavin,  as  quoted  in  Miller  (1991),  concluded. 

The  research  that  exists  to  date  basically  says  that  Chapter  I does 
work  in  the  sense  that  it  helps  kids  do  better  than  they  otherwise 
would  have  done,  but  the  research  also  shows  that  the  program 
doesn’t  work  well  enough  to  help  them  catch  up  with  their  more 
advantaged  peers,  (p.  C8) 

Research  evidence  has  supported  the  contention  that  Title  I has  had  a positive 
impact  on  student  achievement  in  limited  areas.  The  larger  issues,  however, 
are  whether  the  program  is  working  well  enough  to  meet  the  growing  needs  of 
the  nation’s  disadvantaged  youth  and  how  to  best  assess  its  effectiveness 
(Slavin,  1987). 

At  a time  when  the  national  education  system  was  undergoing  increasing 

scrutiny  and  criticism,  proliferated  by  the  publication  of  A Nation  at  Risk  in  1983, 

the  1988  Hawkins-Stafford  Elementary  and  Secondary  School  Improvement 

Amendments  were  intended  to  emphasize  student  achievement  in  higher  order 

analytical,  reasoning,  and  problem-solving  skills.  This  shift  in  emphasis  was 

repeated  during  the  Education  Summit  of  1989,  during  which  the  six  National 

Education  Goals  were  formulated: 

Goal  1 : All  children  in  America  will  start  school  ready  to  learn. 

Goal  2:  The  high  school  graduation  rate  will  increase  to  at  least 
90  percent. 

Goal  3:  American  students  will  leave  grades  four,  eight,  and 
twelve  having  demonstrated  competency  in  challenging  subject 
matter  including  English,  mathematics,  science,  history  and 
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geography;  and  every  school  will  ensure  that  all  students  learn  to 
use  their  minds  well,  so  they  may  be  prepared  for  responsible 
citizenship,  further  learning,  and  productive  employment  in  our 
modern  economy. 

Goal  4:  U.S.  students  will  be  first  in  the  world  in  science  and 
mathematics. 

Goal  5:  Every  adult  American  will  be  literate  and  will  possess  the 
knowledge  and  skills  necessary  to  compete  in  a global  economy 
and  exercise  the  rights  and  responsibilities  of  citizenship. 

Goal  6:  Every  school  in  America  will  be  free  of  drugs  and  violence 
and  will  offer  a disciplined  environment  conducive  to  learning. 

The  adoption  of  the  National  Education  Goals  accelerated  efforts  to 

establish  high  quality  education  programs  and  standards.  During  the  late 

1980s  and  early  1990s,  this  new  effort  eclipsed  the  goals  of  the  Title  I program. 

Evidence  of  this  has  been  gathered  in  several  nationwide  studies  on  the 

effectiveness  of  the  Title  I program.  One  of  the  most  comprehensive  studies  has 

been  the  Prospects  Study,  a 6-year,  longitudinal  study  that  documents  the 

educational  progress  of  disadvantaged  students,  with  base  year  data  collected 

in  1991  (Abt  Associates,  1993).  In  Figure  1,  fourth-grade  reading  results 

collected  nationally  during  the  1992-93  school  year  are  presented. 

Figure  1 documents  the  achievement  gap  between  students  who 

attended  high  poverty  and  low  poverty  schools.  These  findings  are  similar  to 

the  results  of  the  Chapter  I Sustaining  Effects  Study,  conducted  in  the  late 

1970s,  which  found  that  while  Title  I students  gained  more  than  comparably 

disadvantaged  students  on  most  measures,  the  gains  of  the  Title  I participants 

did  not  move  them  substantially  toward  the  achievement  levels  of  more 

advantaged  students  (United  State  Department  of  Education,  1993a,  p.  104). 
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FOURTH  GRADERS'  READING  PERFORMANCE,  BY  LEVEL  OF  SCHOOL  POVERTY 
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Figure  1 . Fourth  graders’  reading  performance  by  level  of  school  poverty. 
Source:  Prospects  (Abt  Associates,  1993). 


The  1994  Reauthorization  of  Title  I 

Results  of  national  studies  on  the  effectiveness  of  Title  I such  as 

Prospects  have  caused  educators,  policy  makers,  and  the  general  public  to 

reexamine  the  effectiveness  of  the  Title  I program  and  to  formulate  a strategy  to 

“reinvent”  the  program,  stressing  that 

(a)  Strategies  that  promote  high  standards  must  be  implemented, 
b)  the  same  high  standards  expected  of  all  children  [regardless  of 
poverty  level]  must  be  set,  (c)  funding  must  be  concentrated  on 
high  poverty  schools,  and  (d)  flexible  use  of  resources  must  be 
conditioned  on  accountability  for  progress  toward  standards. 

(United  States  Department  of  Education,  1993c,  pp.  13-14) 
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In  its  1994  reauthorization,  Title  I of  the  Elementary  and  Secondary 
Education  Act  (ESEA)  was  renamed  Title  I of  the  Improving  America’s  Schools 
Act  (lASA)  which  has  been  described  as  a “renewed  commitment”  to  aiding 
disadvantaged  youth  (Johnston,  1994,  p.  C20).  The  emphasis  was  on  “broad 
State  and  local  flexibility”  in  Title  I program  implementation  (United  States 
Department  of  Education,  1994,  p.  54372).  There  was  an  emphasis  on 
challenging  performance  standards  and  a shift  away  from  the  over  reliance  on 
standardized  norm-referenced  tests.  This  shift  toward  increased  flexibility  and 
the  use  of  multiple  measures  was  viewed  as  one  of  the  most  important  and 
positive  outcomes  of  the  reauthorized  act  (Qualls,  1994,  p.  11). 

In  testimony  presented  before  the  United  States  Senate  Committee  on 
Labor  and  Human  Resources  and  Subcommittee  on  Education,  Arts,  and  the 
Humanities,  Feuer  (1994)  presented  concerns  related  to  the  use  of 
norm-referenced  tests  to  measure  higher  order  analytical,  reasoning,  and 
problem-solving  skills.  As  another  option,  he  suggested  that  alternative  forms  of 
assessment  could  be  “powerful  catalysts  of  improved  classroom  instruction  and 
learning”  (p.  4).  LeTendre  (1991),  Director  of  Compensatory  Education  for  the 
United  States  Department  of  Education,  envisioned  a Title  I program  evaluation 
system  which  “promotes  flexibility  while  maintaining  an  acceptable  level  of 
accountability.  . . and  the  use  of  authentic  assessment  and  multiple  measures” 
(pp.  332-333).  LeTendre’s  view  was  in  keeping  with  the  Advisory  Committee  on 
Testing  in  Chapter  I (United  States  Department  of  Education,  1993a).  The 
committee  recommended  the  use  of  multiple  methods  of  assessment  and  “a 
shift  away  from  procedural  compliance  and  toward  a concentration  on 
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instruction  and  student  learning  outcome^'  (p.  ii,  emphasis  added).  This 

recommendation  was  largely  based  on  an  in-depth  analysis  of  the  previously 

existing  testing  procedures  from  which  the  committee  concluded 

The  overreliance  on  a single  testing  method-aggregated  gain 
scores  on  standardized,  norm-referenced  tests-does  not  provide 
adequate  information  by  which  to  judge  the  progress  of  students, 
the  quality  of  the  school-level  program  or  the  effectiveness  of  the 
national  program,  (p.  vi) 

With  the  1994  reauthorization.  Title  I evaluation  requirements  were  to 
allow  individual  states  flexibility  in  designing  their  system  of  accountability  to  be 
based  on  the  state  accountability  plan.  Title  I required  that  the  progress  of 
students  in  meeting  challenging  standards  be  measured  at  the  grade  levels 
contained  in  each  state  plan,  at  a minimum  of  one  time  between  grades  3 
through  5,  6 through  8,  and  10  through  12.  The  focal  point  of  the  evaluation 
became  the  attainment  of  proficiency,  which  was  defined  by  individual  states,  as 
outlined  in  their  statewide  assessment  plans  (United  States  Department  of 
Education,  1994). 

In  addition  to  the  assessments  in  the  state  accountability  plan,  local 
education  agencies  were  encouraged  to  assess  the  progress  of  students  in 
meeting  proficiency  standards  through  the  use  of  high  quality  assessments  and 
multiple  measures.  Thus,  those  who  evaluate  Title  I programs  at  the  local  level 
will  need  to  engage  in  design  and  implementation  of  alternative  assessment 
methods,  including  performance  assessment,  to  demonstrate  program 
effectiveness  in  the  future. 
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Classroom  Use  of  Portfolio  Assessment 

Many  educators,  dissatisfied  with  mandated  standardized  assessment 

modes,  have  begun  collecting  data  on  student  achievement  using  other  forms 

of  assessment,  namely  performance  assessments  (Hebert,  1992).  A specific 

type  of  performance  assessment  that  has  great  appeal  to  classroom  educators 

is  the  portfolio.  Arter  and  Spandel  (1992)  defined  a student  portfolio  as  “a 

purposeful  collection  of  student  work  that  tells  the  story  of  the  student’s  efforts, 

progress,  or  achievement  in  (a)  given  area(s)”  (p.  36).  At  the  classroom  level, 

performance  assessments  are  most  likely  to  be  created  as  portfolio 

assessments.  Portfolios,  according  to  Arter  and  Spandel, 

should  be  continuous,  capture  a rich  array  of  what  students  know 
and  can  do,  involve  realistic  contexts,  communicate  to  students 
and  others  what  is  valued,  portray  the  processes  by  which  work  is 
accomplished,  and  be  integrated  with  instruction,  (p.  36) 

Portfolios  and  performance  assessments  are  viewed  not  merely  as  means  of 

assessing  students.  As  Wiggins  (1991)  contended,  the  use  of  quality  standards 

in  judging  authentic  student  work  at  the  local  level  can  also  serve  as  a vehicle 

for  school  improvement  and  reform.  Because  tasks  may  be  highly 

individualized,  he  recognized  that  standards  used  in  judging  portfolio  contents 

and  performance  tasks  would  “necessarily  be  varied  [to  meet  the  specific  needs 

of  a given  context]’’  (p.  19).  In  implementing  a portfolio  system,  issues  of 

representativeness,  criteria  to  be  used,  differences  in  interpretations,  content, 

and  questions  related  to  the  design  of  the  system  must  be  considered.  At  the 

classroom  level,  pragmatic  issues  such  as  who  selects  the  work  to  be  entered  in 

the  portfolio,  how  the  portfolio  relates  to  instruction,  and  how  to  find  the  time  to 
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talk  with  students  about  their  portfolio  must  also  be  considered  (Arter  & 

Spandel,  1992). 

Although  some  states  and  a number  of  school  districts  have  begun 
replacing  testing  programs,  either  in  whole  or  in  part,  with  portfolios  (Pelavin, 
1991),  the  portfolio  movement,  for  the  most  part,  has  been  rooted  at  the 
classroom  level  (Calfee  & Perfume,  1993).  Calfee  and  Perfume  surveyed  150 
selected  contact  persons  on  their  portfolio  practices  nationwide,  including  state, 
district,  school,  and  teacher  representatives.  Data  were  collected  via  a survey 
as  well  as  during  a 2-day  conference  in  which  follow-up  interviews  were 
conducted  with  24  of  the  original  respondents.  In  their  qualitative  analysis  of 
these  two  sets  of  data,  three  themes  pertaining  to  portfolio  use  surfaced:  (a) 
Teachers  who  used  portfolios  conveyed  an  intense  commitment  and  personal 
renewal;  (b)  the  technical  foundations  for  portfolio  assessment  appeared 
inconsistent  at  all  levels;  and  (c)  portfolio  practice  at  the  school  and  teacher 
level  shied  away  from  standards  and  grades  (p.  534).  In  particular,  Calfee  and 
Perfumo  found  that  while  respondents  claimed  that  portfolios  were  a valid 
assessment  of  student  progress  and  growth,  no  supporting  documentation  or 
information  was  reported  as  to  how  validity  and  reliability  of  the  portfolios  had 
been  established  (p.  535).  Furthermore,  they  found  that  the  areas  of  scoring 
and  standard  setting  were  generally  not  developed  (p.  535).  In  subsequent 
sections,  issues  surrounding  the  use  of  portfolios  and  their  validation  are 


reviewed. 
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Arguments  for  Classroom  Portfolio  Systems 
In  their  search  for  an  assessment  system  that  was  congruent  with  an 
integrated  approach  to  language  instruction,  a group  of  teachers  at  an  Illinois 
elementary  school  developed  their  own  portfolio  assessment  system  (Hebert, 
1992).  They  found  that  the  use  of  portfolios  along  with  student  reflection  on 
portfolio  contents  had  positively  impacted  their  ability  to  assess  multiple 
dimensions  of  learning  (p.  59).  Five  (1993)  found  that  using  language  arts 
portfolios  in  her  grade  5 classroom  helped  students  in  goal  setting  and  to 
“discover  and  define  themselves  as  readers  and  writers”  (p.  48).  In  an  inner-city 
school  in  New  Hampshire,  Hansen  (1992)  found  that  having  students  select 
items  for  their  portfolios  helped  them  to  focus  better  and  to  plan  relevant 
curriculum  for  themselves  through  goal  setting  (p.  68).  Buschman  (1993)  used 
portfolios  in  his  second-grade  classroom  in  Oregon  and  contended  that  “student 
portfolios  have  brought  rare  benefits”  (p.  22).  He  cited  the  “more  complete 
picture”  of  his  students’  academic  abilities  and  the  opportunity  for  self-reflection 
for  his  students  as  two  of  the  benefits  of  using  a portfolio  system  in  the 
classroom.  These  examples  illustrate  the  contention  that  portfolio  assessment 
may  offer  a broader  view  of  students’  achievement  and  to  improve  instruction 
and  learning. 

Alternative  Assessments  in  Program  Evaluation 
In  the  evaluation  of  educational  programs,  recent  years  have  witnessed 
growing  discontent  with  the  sufficiency  of  norm-referenced  multiple  choice  tests 
to  measure  student  achievement.  Numerous  authors  have  discussed  the 
inadeguacy  of  such  tests,  particularly  when  they  are  used  to  meet  a variety  of 
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reporting  requirements  (e.g.,  Linn,  1993a;  Madaus,  1988;  Nitko,  1995;  Smith  & 
O’Day,  1991).  For  example,  Jenkins  (1993)  described  some  of  the  problems 
with  using  such  tests  in  high-stakes  accountability  situations.  He  noted  that 
younger  children  often  have  trouble  with  multiple  choice  norm-referenced  test 
(MC-NRT)  answer  sheets,  may  experience  high  test  anxiety,  and  often  have 
little  motivation  to  perform  well  on  the  test.  He  stressed  that  often  a mismatch  of 
the  test  to  the  curriculum  makes  results  invalid  and  causes  teachers  to  teach  to 
the  test  in  order  to  overcome  this  mismatch  (pp.  3-5).  Jenkins  suggested  that 
one  solution  to  these  problems  would  be  “the  development  of  valid  and  reliable 
alternative  measures  to  standardized  tests”  (p.  7).  Similar  arguments  for 
alternative  assessments  have  been  expressed  by  Herman  and  Golan  (1993), 
Shepard  (1992a),  Smith  (1991),  and  Snyder,  Chittenden,  and  Ellington  (1993). 
In  particular,  Snyder  et  al.  specifically  cited  the  Chapter  I program  as  an 
example  of  “the  possible  policy  pitfalls  of  failing  to  use  classroom  evidence  for 
decision  making”  (p.  9).  In  studying  the  relationship  of  two  indicators  of  reading 
achievement,  they  found  that  while  the  two  sources  of  data  yielded  relatively 
close  agreement  for  high-achieving  students,  there  was  a clear  discrepancy 
between  the  two  data  sources  for  the  lower  achieving  students  (p.  1).  Because 
the  Chapter  I program  specifically  targets  students  in  lower  socioeconomic 
communities  who  score  poorly  on  standardized  tests,  standardized  test  scores 
provide  questionable  data  for  the  very  students  Chapter  I aims  to  serve.  The 
authors  suggested  that  if  social  consequence  is  a key  validity  criterion,  then 
classroom  evidence  of  student  achievement  needs  to  be  legitimated. 
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The  alternative  to  standardized  tests  is  commonly  referred  to  as 
performance  assessment.  The  term  performance  assessment , as  defined  in 
Linn  (1993a),  generally  refers  to  assessment  tasks  that  require  students  to 
perform  an  activity  or  construct  a response.  Extended  periods  of  time,  ranging 
from  several  minutes  to  several  weeks,  may  be  needed  to  perform  the  tasks. 
Task  performance  is  often  itself  a beneficial  part  of  instruction.  Miller  and  Legg 
(1993)  agreed  that  while  performance  assessments  are  susceptible  to  many  of 
the  shortcomings  of  traditional  assessments,  they  may  provide  a “more  holistic 
approach  to  assessment”  (p.  12)  and  that  test  preparation  activities  would  result 
in  learning  the  underlying  skills  and  concepts  rather  than  just  how  to  take  the 
test.  These  characteristics  have  generated  substantial  levels  of  acceptance  for 
these  forms  of  measurement  within  the  ranks  of  teachers  and  school 
administrators  (e.g.,  Wiggins,  1992).  At  the  classroom  level,  performance 
assessments  are  most  likely  to  be  created  as  portfolio  assessments.  Research 
in  the  area  of  portfolio  assessment  has  increased  in  recent  years.  For  example. 
Moss  et  al.  (1992)  investigated  the  use  of  writing  portfolios  for  general 
accountability  purposes  at  the  eighth-grade  level.  They  proposed  an 
assessment  model  in  which  “teachers  and  students  are  encouraged  to  make 
intellectual  and  creative  choices  that  reflect  their  own  goals  and  interests  and  in 
which  teachers’  interpretations  . . . play  the  central  role”  (p.  14).  They  pointed 
out  that  portfolios,  like  any  other  assessment,  must  be  concerned  with  the 
issues  of  reliability  and  validity  but  that  these  can  be  established  with  further 
research,  including  triangulation  across  data  sources  (p.  14).  They  concluded 
that  portfolios  provide  “an  enhanced  quality  of  information  that  includes  an 
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integrative  interpretation  of  the  achievement  and  growth  reflected  in  the  student 
work  based  upon  an  intimate  knowledge  of  the  learning  context”  (p.  14). 

Performance  assessments,  however,  may  not  be  a panacea  for  the 
perceived  problems  of  MC-NRTs.  In  subsequent  sections,  the  issues 
surrounding  validation  of  performance  assessments  and  empirical  studies  of 
their  perceived  credibility  and  their  validity  are  reviewed.  Mehrens  (1992),  in 
discussing  the  use  of  performance  assessments  for  accountability  purposes, 
advises  that  “we  must  be  prudent  in  our  charges  regarding  the  ills  of 
multiple-choice  tests  and  our  claims  about  the  wonders  of  performance 
assessment”  (p.  5).  The  increased  cost  (both  in  time  required  for  administration 
and  scoring  and  money  to  develop  performance  assessments)  and  the 
subjectivity  involved  in  such  assessments  must  be  carefully  weighed  against 
the  cost  effectiveness,  objectivity,  and  minimal  time  requirements  for 
administration  and  scoring  of  MC-NRTs. 

The  Validation  of  Performance  Assessments 

The  issue  of  how  performance  assessments  should  be  validated  is 
complicated  by  the  fact  that  validity  theory  itself  is  being  recreated,  largely  due 
to  the  influence  of  Messick’s  1989  seminal  work  (Moss,  1992).  The  term  validity 
pertains  to  the  degree  to  which  empirical  evidence  and  theoretical  rationales 
support  the  adequacy  and  appropriateness  of  interpretations  and  actions  based 
on  test  scores  or  other  modes  of  assessment  (Messick,  1989).  In  studying  the 
validity  of  assessment  instruments,  the  traditional  tripartite  scheme  composed  of 
construct-,  content-,  and  criterion-related  evidence  has  been  the  most  widely 
used.  As  discussed  in  Moss  (1992),  construct-related  evidence  focuses 
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primarily  on  the  test  score  as  a measure  of  the  psychological  characteristic  of 
interest.  Content-related  evidence  demonstrates  the  degree  to  which  the 
sample  of  items,  tasks,  or  questions  on  a test  are  representative  of  some 
defined  universe  or  domain  of  content.  Criterion-related  evidence 
demonstrates  that  test  scores  are  systematically  related  to  one  or  more  outcome 
criteria  (from  the  1985  Standards  for  Educational  and  Psychological  Testing,  as 
cited  in  Moss,  1992,  pp.  9-11).  Moss  followed  Messick  (1989)  in  stressing  the 
centrality  of  construct  validation  in  the  validity  process  and  that  the  validation 
process  for  performance  assessments  becomes  problematic  when  traditional 
approaches  are  used. 

One  set  of  criteria  for  judging  the  quality  of  performance-based 
assessments  was  suggested  by  Linn,  Baker,  and  Dunbar  (1991)  who  also 
supported  the  notion  that  traditional  forms  of  validation  were  not  sufficient  for 
use  with  performance-based  assessments.  They  suggested  eight  criteria  for 
validation,  including  (a)  evidence  regarding  the  intended  and  unintended 
consequences,  (b)  the  degree  to  which  the  performance  on  specific  assessment 
tasks  transfers,  (c)  the  fairness  of  the  assessments,  (d)  cognitive  complexity  of 
the  processes  students  will  use  in  solving  assessment  problems,  (e) 
meaningfulness  of  the  problems  for  teachers  and  students,  (f)  a basis  for 
judging  the  content  quality,  (g)  a basis  for  judging  comprehensiveness  of 
content  coverage,  and  (h)  cost  of  the  assessment.  Similarly,  Miller  and  Legg 
(1993)  also  cautioned  against  the  use  of  traditional  approaches  to  validation  for 
performance  assessment,  particularly  when  the  results  of  such  assessments  will 
be  used  for  accountability  purposes.  They  called  for  additional  research  on  the 
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topic,  particularly  in  the  area  of  the  differential  validity  of  performance 
assessments  from  traditional  assessments. 

Recently,  Messick  (1995)  set  forth  a list  of  those  aspects  of  validation 
which  he  deems  most  relevant  to  performance  assessment.  The  six  aspects  of 
validation  are  as  follows: 

1 . The  content  aspect  of  construct  validity  includes  evidence 
of  content  relevance,  representativeness,  and  technical 
quality. 

2.  The  substantive  aspect  refers  to  theoretical  rationales  for  the 
observed  consistencies  in  test  responses,  including  process 
models  of  task  performance,  along  with  empirical  evidence  that 
the  theoretical  processes  are  actually  engaged  by  respondents 
in  the  assessment  tasks. 

3.  The  structural  aspect  appraises  the  fidelity  of  the  scoring 
structure  to  the  structure  of  the  construct  domain  at  issue. 

4.  The  generalizability  aspect  examines  the  extent  to  which  score 
properties  and  interpretations  generalize  to  and  across 
population  groups,  settings,  and  tasks,  including  validity 
generalization  of  test-criterion  relationships. 

5.  The  external  aspect  includes  convergent  and  discriminant 
evidence  from  multitrait-multimethod  comparisons,  as  well  as 
evidence  of  criterion  relevance  and  applied  utility. 

6.  The  consequential  aspect  appraises  the  value  implication  of 
score  interpretation  as  a basis  for  action  as  well  as  the  actual 
and  potential  consequences  of  test  use,  especially  in  regard  to 
sources  of  invalidity  related  to  issues  of  bias,  fairness,  and 
distributive  justice,  (pp.  3-4) 

The  fifth  aspect  of  Messick’s  framework  is  based  upon  examining  the 
relationships  among  multiple  sources  of  data  which  are  purported  to  measure 
the  same  construct.  Such  use  of  multiple  measures  is  called  triangulation.  As 
Webb,  Campbell,  Schwartz,  and  Sechrest  (1966  as  cited  in  Mathison,  1988) 
noted. 


Once  a proposition  has  been  confirmed  by  two  or  more 
independent  measurement  processes,  the  uncertainty  of  its 
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interpretation  is  greatly  reduced.  The  most  persuasive  evidence 
comes  through  a triangulation  of  measurement  processes,  (p.  3) 

Hambleton  and  Murphy  (1992),  as  well  as  Linn  (1989),  supported  the  use  of 

multiple  methods  of  assessment  in  validation  of  performance  assessments. 

Construct  validity  traditionally  has  been  concerned  with  the  relationships 

of  test  scores  to  external  variables.  The  conceptualization  of  convergent  and 

discriminant  validity  coefficients  proposed  by  Campbell  and  Fiske  (1959,  1967) 

provides  a means  for  addressing  the  requirement  that  test  scores  should 

strongly  relate  to  some  conceptually  similar  external  variables  and  relate  less 

strongly  to  other  conceptually  different  variables.  Convergent  validity 

coefficients  are  described  as  monotrait,  heteromethod  coefficients,  that  is, 

correlations  between  two  measures  of  the  same  trait  obtained  by  different 

methods.  Discriminant  validity  coefficients  are  heterotrait,  monomethod 

coefficients  or  heterotrait,  heteromethod  coefficients.  Theoretically  and 

empirically,  these  latter  types  of  coefficients  should  be  lower  than  convergent 

validity  coefficients  to  claim  evidence  of  validity  of  the  measures.  In  applications 

of  this  approach  to  validation,  validity  coefficients  must  be  disattenuated  for  the 

effects  of  different  reliabilities  of  the  measurement  methods.  In  using  this 

validation  approach  with  performance  assessments  for  Title  I programs,  the 

traits  of  interest  would  be  achievement  in  reading,  mathematics,  and  language 

arts  (writing).  The  methods  most  commonly  of  interest  would  be  portfolio 

assessments,  MC-NRTs,  and  standardized,  on-demand  performance 

assessment  tests,  such  as  the  Florida  Writing  Assessment. 
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Messick  (1994)  examined  the  issue  of  the  construct  representation  of 

performance  assessments  emphasizing  the  concepts  of  authenticity  and 

directness  of  performance  assessments.  According  to  Messick,  these  two 

concepts  may  nullify  the  two  major  threats  to  construct  validity,  namely, 

construct  under  representation  and  construct-irrelevant  variance,  respectively. 

Together,  authenticity  and  directness  “signal  the  need  for  convergent  and 

discriminant  evidence  that  the  test  is  neither  unduly  narrow  because  of  missing 

construct  variance  nor  unduly  broad  because  of  added  method  variance”  (p. 

22).  These  concepts  take  on  special  meaning  in  the  arena  of  performance 

assessment,  as  issues  pertaining  to  item  and  task  contexts  and  the 

generalizability  of  test  results  are  beginning  to  be  scrutinized.  The  overarching 

issue  pertains  to  the  contextualization  of  items  and  tasks: 

There  is  no  necessary  assumption  that  a skill  takes  the  same  form 
in  all  contexts.  What  is  important  is  not  that  the  skill  appears 
different  in  different  contexts,  but  that  it  changes  non  randomly  with 
conditions  and  hence  correlates  with  construct-relevant  variables. 

(p.  18) 

Because  interactions  with  context  are  inevitable,  Messick  offered  a number  of 
possible  approaches  to  compensating  for  them  in  measurement,  including 
attempting  to  strip  the  problem  context  of  irrelevancies,  attempting  to  draw 
inferences  from  consistencies  in  behavior  across  contexts  (or  across  varied 
tasks  within  context),  modeling  the  skill  as  a function  of  variation  in  task  difficulty 
and  contextual  influence,  and  treating  skills  revealed  in  different  contexts  as 
qualitatively  different  skills  (although  he  did  not  advocate  this  latter  approach). 
Due  to  the  infinite  number  of  skill-context  combinations,  Messick  suggested  that 
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the  emphasis  should  not  be  on  contextualized  versus  decontextualized 
assessment  but,  rather,  on  contextualized  versus  cross-contextual  assessment. 
In  using  multiple  and  varied  contextualized  tasks,  student  interest  (e.g., 
motivation  and  effort)  could  be  maintained  while  allowing  a cross-contextual 
approach  to  validation  to  be  taken.  However,  inherent  in  this  approach  are 
questions  of  contextual  bias: 

We  should  not  take  for  granted  that  a richly  contextualized 
assessment  task  is  uniformly  good  for  all  students  . . . contextual 
features  that  engage  and  motivate  one  student  and  facilitate  his  or 
her  effective  task  performance  may  alienate  and  confuse  another 
student  and  bias  or  distort  task  performance,  (p.  19) 

To  cope  with  differential  student  responsiveness  to  context,  Messick  suggested 

that  an  aggregate  measure  of  the  construct  across  a variety  of  item  contexts  be 

developed  (p.  19).  An  added  benefit  of  the  construct-related  approach  to 

performance  assessment  is  that  the  meaning  of  the  construct  to  be  assessed 

also  provides  a rational  basis  for  hypothesizing  potential  testing  outcomes  and 

for  anticipating  possible  side  effects  (e.g.,  increased  adverse  impact  for  gender 

and  racial  ethnic  groups)  (p.  22). 

Teachers’  Perceptions  of  the  Implementation 
of  Performance  Assessments 

In  1991-92,  the  state  of  Vermont  initiated  a statewide  writing  and 
mathematics  portfolio  system  in  grades  4 and  8.  Vermont  was  the  first  state  to 
make  portfolios  the  focal  point  of  a statewide  assessment  system,  and  the 
state’s  innovative  efforts  attracted  nationwide  attention.  During  the  second  year 
of  portfolio  implementation,  Stecher  and  Hamilton  (1994)  surveyed  teachers  in 
grades  4 and  8 regarding  the  use  of  the  statewide  mathematics  portfolio  system. 
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They  received  519  completed  surveys,  three-fourths  from  grade  4 and 
one-fourth  from  grade  8,  a sample  which  represented  52%  of  all  Vermont 
teachers  who  taught  mathematics  in  grade  4 and  41%  of  mathematics  teachers 
in  grade  8.  Their  analyses  focused  on  the  implementation;  impact  on 
curriculum,  instruction,  and  student  achievement;  and  the  burdens  of  the 
portfolio  system. 

In  terms  of  implementation,  the  majority  of  teachers  believed  that  the 
portfolio  system  was  not  being  uniformly  implemented  across  classrooms  and 
schools.  Policies  pertaining  to  the  amount  of  revising  done  to  best  pieces  of 
work  as  well  as  who  could  help  the  students  with  their  revisions  and  what  role 
the  teacher  played  in  the  selection  of  pieces  varied  greatly.  However,  many 
similarities  among  portfolio  implementation  practices  were  also  cited,  including 
that  teachers  were  using  portfolios  with  the  vast  majority  of  their  students  and 
the  emphasis  to  be  placed  upon  different  aspects  of  best  pieces  for  inclusion  in 
the  portfolio  were  found  to  be  consistent. 

In  terms  of  changes  in  curriculum  and  instruction,  most  teachers  reported 
substantial  changes  in  curriculum  focus  and  teaching  methods  since 
implementing  the  portfolio  system.  Curriculum  changes  were  found  to  be  most 
evident  in  the  areas  of  problem  solving  and  mathematical  communication.  In 
terms  of  impact  on  student  achievement,  one-half  of  the  teachers  reported  that 
students  were  learning  mathematics  better  because  of  the  portfolio  system.  The 
primary  sources  of  improvement  were  in  students’  thinking  and  reasoning  about 
math.  Although  the  researchers  reported  overall  support  for  the  portfolio 
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system,  teachers  voiced  concern  about  the  time  burdens  involved  in 
implementing  the  portfolio  program. 

Leitner  and  Trevisan  (1993)  also  surveyed  teachers  in  order  to  assess 
their  perceptions  of  the  effects  of  using  performance  portfolios  for  accountability 
purposes.  Teachers  of  grades  K-8  in  three  schools  targeted  for  Title  I program 
improvement  were  included.  Thirty-one  teachers  participated  in  the  study, 
which  used  the  Stages  of  Concern  questionnaire.  Levels  of  Use  inten/iews,  and 
site  observations.  Data  from  the  Stages  of  Concern  questionnaire  indicated 
that  most  of  the  teachers  were  concerned  with  management  issues  such  as 
planning  time,  the  need  for  staff  development,  and  scoring  rubric  development. 
The  data  from  the  Levels  of  Use  interviews  indicated  that  the  staff  were  at  Level 
III,  mechanical  use,  suggesting  that  teachers  were  using  portfolios,  but  they 
were  “not  using  all  the  components  and  are  using  them  in  a stepwise  fashion, 
resulting  in  disjointed  and  superficial  use”  (p.  9).  Data  collected  during  site 
observations  and  meetings  indicated  that  teachers  had  concerns  regarding 
portfolio  implementation  in  the  following  areas:  (a)  district  support,  (b)  how 
portfolios  related  to  report  card  grades,  (c)  time  required  for  implementation,  (d) 
parental  support,  and  (e)  impact  on  the  instructional  program.  Teachers  also 
reported  gaining  a more  balanced  view  of  assessment  from  implementing  the 
portfolio  program. 

Leitner  and  Trevisan  concluded  that  as  the  development  of  portfolio 
systems  progresses,  the  need  for  validation  will  become  critical.  Furthermore, 
these  findings  underscore  the  need  for  comprehensive  approaches  to 
validating  complex  performance  assessments  using  pragmatic  criteria  such  as 
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proposed  by  Linn,  Baker,  and  Dunbar  (1991),  particularly  the  criteria  of  cost  and 
efficiency  and  generalizability.  Messick’s  (1995)  criteria  also  seem  relevant  in 
light  of  these  findings. 

Moss  et  al.  (1992)  used  a case-study  approach  to  investigate  the  use  of 
writing  portfolios  for  general  accountability  purposes  at  the  eighth-grade  level. 
They  proposed  an  assessment  model  in  which  “teachers  and  students  are 
encouraged  to  make  intellectual  and  creative  choices  that  reflect  their  own 
goals  and  interests  and  in  which  teachers’  interpretations  . . . play  the  central 
role”  (p.  14).  They  concluded  that  portfolios  provide  “an  enhanced  quality  of 
information  that  includes  an  integrative  interpretation  of  the  achievement  and 
growth  reflected  in  the  student  work  based  upon  an  intimate  knowledge  of  the 
learning  context”  and  supported  the  view  that  portfolios,  like  any  other 
assessment,  must  be  concerned  with  the  issues  of  reliability  and  validity  but  that 
these  can  be  established  with  further  research,  including  triangulation  across 
data  sources  (p.  14). 

LeMahieu,  Gitomer,  and  Eresh  (1995)  investigated  the  use  of  a district- 
wide writing  portfolio  assessment  and  teachers’  perceptions  of  the  system.  The 
system,  including  the  rubric,  was  derived  by  many  teachers  and  administrators 
“repeatedly  examining  student  work  and  developing  a vocabulary  that  people 
could  use  to  discuss  that  work”  (p.  25).  This  shared  development  had  a major 
impact  on  the  perceptions  of  teachers  and  how  they  used  portfolios.  LeMahieu 
et  al.  believed  that  assessment  should  be  viewed  not  as  an  added  task  in  the 
instructional  process  but  rather  “as  a central  component  of  teaching  and 
learning”  (p.  26).  Especially  significant  was  the  fact  that  although  results  from 
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the  portfolio  assessment  indicated  that  the  students  were  not  meeting  the 
expectations  of  the  writing  program,  teachers  were  nonetheless  accepting  of  the 
portfolio  system.  This  acceptance  was  attributed  largely  to  the  fact  that 
portfolios  were  directly  relevant  to  classroom  experiences  and  that  their 
contents  were  directly  connected  to  instructional  events  (p.  27). 

Parent  Perceptions  of  Standardized  and  Performance  Assessments 

Based  upon  the  findings  of  previous  studies  (e.g.,  Leitner  & Trevisan, 
1993),  Shepard  and  Bliem  (1995)  studied  parents’  perceptions  about 
standardized  tests  and  performance  assessments.  Data  were  collected  by 
means  of  questionnaire  surveys  and  extended  interviews  of  33  parents  of  third 
graders.  They  found  that,  while  the  majority  of  parents  approved  of  both  types 
of  measures,  performance  assessments  had  a higher  approval  rating  than  did 
standardized  tests  (p.  31).  While  a majority  of  parents  (60%)  felt  that  seeing 
graded  samples  of  their  child’s  work  was  very  useful  in  learning  about  their 
child’s  progress  in  school,  standardized  test  scores  were  seen  as  very  useful 
only  14%  of  the  time.  Parents’  favoritism  toward  performance  tasks  appeared  to 
be  based  on  the  premise  that  parents  felt  the  performance  tasks  make  children 
think  and  that  they  encourage  children  to  use  their  imagination.”  Key  features  of 
parents  perceptions  of  standardized  tests  were  that  children  could  easily  guess 
the  answers,  and  that  while  the  items  were  objective,  they  were  often  too  easy. 
Analysis  of  interview  data  revealed  that  parents  “seemed  consistently  to  trust 
teachers  and  to  have  confidence  in  teachers’  professional  judgment”  (p.  27). 
Furthermore,  while  many  parents  expressed  a need  for  some  type  of  normative 
data,  they  also  felt  that  this  need  could  be  met  by  their  being  informed  by  the 
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teacher  how  their  child  was  doing  in  relation  to  grade-level  expectations.  This 
suggests  that  parents  would  value  even  locally  developed  benchmarks  or 
performance  standards. 

Empirical  Studies  of  the  Validity  and  Reliability  of  Performance  Assessments 

Thus  far  only  a handful  of  studies  have  been  conducted  in  which 
researchers  attempted  to  determine  the  extent  to  which  students  display  similar 
levels  of  performance  across  the  various  methods  of  measurement. 
Methodologies  and  results  of  these  studies  are  summarized  in  Tables  2 and  3. 
For  each  study,  the  grade  level,  sample  size  and  selection  method,  content 
area,  assessment  method(s),  interrater  reliability,  task  reliability,  the  unit  of 
analysis,  and  results  are  reported.  A more  detailed  description  of  the  studies 
included  in  Tables  2 and  3 may  be  found  in  the  Appendix  A. 

Of  the  studies  summarized  in  the  table,  only  Buttram  and  McCann  (1993) 
focused  exclusively  on  the  student  population  served  by  Title  I.  While  several 
researchers  focused  on  multiple  areas  of  achievement  (e.g..  Burger  & Burger, 
1994;  Buttram  & McCann,  1993;  Ercikan  & Schwarz,  1995;  Koretz,  Klein, 
McCaffrey,  & Stecher,  1993;  Stevens  & Clauser,  1995),  none  included  the 
simultaneous  analysis  of  achievement  data  in  reading,  writing/language,  and 
math.  Furthermore,  only  two  of  the  studies  involved  the  use  of  portfolio 
assessments  (Koretz  et  al.,  1993;  LeMahieu  et  al.,  1995).  While  both  of  these 
studies  demonstrated  reasonable  reliabilities  for  their  portfolio  assessments,  the 
former  focused  only  in  the  areas  of  writing  and  math  while  the  latter  contained 
analyses  in  the  area  of  writing  only  and  in  both  cases  the  samples  cut  across 
grade  levels  (e.g.,  elementary  and  middle  school  or  middle  school  and  high 
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Gentile  4 653  Writing  NAEP  (Writing)  .76-1.00  --  Student  Percentage  of  agreement 

(1992)  8 899  School-based  Writing  between  assessment 

methods  was  77%  in  grade  4 
and  55%  in  grade  8. 
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Table  3 

Sampling  Methods  for  Empirical  Studies  of  Performance  Assessments 


Study 

Sample  Selection  Method 

Baxter  et  al. 
(1992) 

2 schools  within  2 districts 

Burger  & Burger 
(1994) 

School  district  in  a small 
city  in  the  western  United  States 

Buttram  & McCann 
(1993) 

4 elementary  schools  from  a 
large  urban  district  in  the  northeast 
United  States 

Ercikan  & Schwarz 
(1995) 

Random  sample  of  students 
across  a state 

Gentile 

(1992) 

Nationally  representative  group 

Koretz  et  al. 
(1993) 

Random  sample  of  3-5  portfolios 
chosen  from  each  participating 
school  in  the  state 

LeMahieu  et  al. 
(1995) 

Stratified  random  sample  of 
classes  within  a district 

Shepard  et  al. 
(1996) 

13  3rd-grade  classrooms  in  3 
schools  and  3 additional  control 
schools 

Stevens  & Clauser 
(1995) 

Random  sample  of  students  in  a large 
suburban  district  in  the  southwest 
United  States 
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school).  Therefore,  the  results  of  the  two  studies  using  portfolio  assessments 
seem  unlikely  to  be  generalizable  to  subject  areas  or  homogeneous  types  of 
samples  that  are  the  focus  of  Title  I programs.  Thus,  no  research  to  date  has 
related  district  portfolio  assessments  to  standardized  test  and  demand 
performance  tasks  for  Title  I students.  Particularly  noteworthy  is  the  lack  of 
research  that  is  all-inclusive  of  the  areas  of  reading,  math,  and  writing  which  are 
the  subject  areas  of  major  focus  in  many  federal  and  state  program  evaluations. 

Summary 

A number  of  researchers  have  expressed  concerns  about  the  use  of 
performance  and  portfolio  assessments  for  accountability  purposes  (Jenkins, 
1993;  Ligon,  1993;  Mehrens,  1992;  Miller  & Legg,  1993).  There  are  mixed 
findings  to  indicate  that  the  use  of  such  measures  is  a valid  and  reliable  practice 
(Baxter,  Shavelson,  & Pine,  1992;  Burger  & Burger,  1994;  Buttram  & McCann, 
1993;  Gentile,  1992;  Koretz  et  al.,  1993;  LeMahieu  et  al.,  1995;  Moss  et  al., 
1992).  A growing  body  of  literature  suggests  that  performance  assessments 
and  traditional  paper  and  pencil  assessments  may  not  be  measuring  a common 
underlying  construct  (Ercikan  & Schwarz,  1995;  Stevens  & Clauser,  1995). 
Typically,  these  studies  focus  on  how  achievement  in  a given  subject  area 
correlates  across  different  types  of  assessments  such  as  the  traditional 
standardized  norm-referenced  achievement  tests  and  newer  forms  of 
assessment  such  as  portfolio  and  performance  assessment. 

Rising  interest  in  improving  the  “relevance”  of  these  new  forms  of 
assessment  has  led  to  the  development  of  a large  number  of  alternative 
assessment  programs.  As  Messick  (1994,  1995)  has  pointed  out,  however. 
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claims  of  relevance  and  authenticity  are  best  viewed  as  validity  arguments  that 
must  be  evaluated  empirically.  One  method  for  conducting  such  evaluations  is 
the  currently  proposed  study  in  which  alternative  assessment  approaches  are 
simultaneously  compared  to  traditional  assessments  for  the  same  examinees 
over  three  academic  domains. 

The  present  study  has  included  data  collected  during  the  course  of  a 
school  year  from  the  academic  areas  of  reading,  writing,  language  arts,  and 
mathematics,  as  well  as  a demand-performance  task  and  a standardized 
norm-referenced  achievement  test.  The  focus  of  the  present  research  is  the 
examination  of  concurrent  and  discriminant  validity  of  these  varying  forms  of 
assessment,  in  the  context  of  Title  I program  evaluation  at  the  school-district 
level. 

In  the  present  research,  data  from  a portfolio-based  measure  as  well  as  a 
standardized,  norm-referenced  achievement  test  and  state  writing  performance 
assessment  data  were  obtained  from  a middle-sized  school  district  with  a 
student  population  of  approximately  40,000.  The  present  study  focused  on 
grades  3 through  5 in  seven  Title  I schools.  The  present  research  was 
designed  to  overcome  the  major  limitations  noted  for  previous  studies  with 
regard  to  the  limited  grade  levels  which  have  been  studied;  the  lack  of  research 
that  is  all-inclusive  of  the  areas  of  reading,  math,  and  writing;  and  to  offer  a 
broadened  perspective  in  regards  to  a system  that  allows  for  the  use  of 
classroom-based  assessments  for  the  purposes  of  external  accountability. 

Such  a system,  as  Nitko  (1995),  Feuer  (1994),  and  others  have  contended. 


would  work  in  concert  with  curricular  reforms  and  serve  as  a catalyst  for 
improved  instruction. 


CHAPTER  3 
METHOD 


Purpose 

The  purpose  of  this  study  was  to  investigate  the  construct  validity  for  a 
portfolio-based  assessment  in  relation  to  a MC-NRT  and  a writing  performance 
assessment  in  the  evaluation  of  a Title  I program  in  a middle-sized  district  in 
west-central  Florida.  The  study  focused  on  the  relationship  among  the  three 
instruments  used  in  seven  Title  I schoolwide  programs.  A schoolwide  program 
is  one  in  which  Title  I funds  are  used  to  upgrade  the  entire  educational  program 
in  a school  according  to  an  individual  school  plan.  During  the  1995-96  school 
year,  in  order  for  a school  to  be  eligible  for  schoolwide  program  status,  at  least 
60%  of  its  students  must  have  been  eligible  for  a free  or  reduced-price  lunch. 
Messick's  (1989,  1995)  themes  of  generalizability,  external,  and  consequential 
aspects  of  validity  were  used  to  guide  the  design  of  the  study. 

Context  of  the  Study 

National  Context 

This  analysis  occurred  at  a time  when  the  Title  I program  was 
implementing  changes  as  required  by  the  1994  reauthorization.  The  new 
legislation  deemphasizes  census  standardized  achievement  testing  and 
promotes  the  development  of  performance  standards  and  alternative 
assessment  measures.  This  recent  shift  in  emphasis  towards 
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performance-based  measures  is  found  not  only  in  the  evaluation  of  Title  I 
programs  but  also  in  the  broader  arena  of  educational  accountability  as  well. 

With  the  1994  reauthorization,  the  emphasis  of  Title  I evaluation  shifted 
from  federally  mandating  the  use  of  Normal  Curve  Equivalent  (NCE)  gain 
scores  on  norm-referenced  tests  for  students  in  grades  2 and  above  to  allowing 
states  flexibility  in  designing  their  system  of  accountability  based  on  individual 
state  accountability  plans.  Title  I required  that  the  progress  of  students  in 
meeting  challenging  standards  be  measured  at  the  grade  levels  contained  in 
each  state  plan,  at  a minimum  of  one  time  between  grades  3 through  5,  6 
through  8,  and  10  through  12. 

State  Context 

Because  Florida’s  Blueprint  2000  legislation  required  assessment  data 
to  be  collected  for  accountability  purposes  for  students  in  grades  4,  8,  and  10, 
the  focus  of  the  Title  I evaluation  in  the  state  of  Florida  became  these  same 
grade  levels.  The  focal  point  of  the  evaluation  became  the  attainment  of 
proficiency,  which  at  the  elementary  level  was  defined  as  follows: 

1 . On  the  district’s  norm-referenced  standardized  achievement  test,  at 
least  33%  of  the  students  scoring  above  the  50th  percentile  in  the  areas  of 
reading  comprehension  and  mathematics  concepts  and  applications. 

2.  On  the  Florida  Writing  Assessment,  at  least  33%  of  the  students 
scoring  at  or  above  3.0  on  a 6-point  scale. 

These  levels  of  proficiency  were  also  used  to  identify  “Critically  Low  Performing 
Schools”  and  “Performance  Improvement  Schools”  throughout  the  state  of 
Florida.  Although  schools  identified  as  “Critically  Low”  are  not  necessarily  Title  I 
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schools,  and  many  Title  I schools  are  not  in  the  low  category,  there  is 
substantial  overlap  (Florida  Department  of  Education,  1996).  This  heightens 
district  concerns  about  the  effectiveness  of  its  Title  I programs  in  producing 
favorable  results  on  these  indicators  and  underscores  the  issue  of  whether 
portfolio  assessments  yield  disparate  rather  than  complementing  results  from 
these  other  measures. 

District  Context 

Students,  teachers,  and  reading  specialists  from  a middle-sized 
suburban/rural  school  district  in  central  Florida  provided  the  data  for  this  study. 
This  school  district  provided  a unique  opportunity  for  research  on  this  topic,  as  it 
has  completed  its  sixth  year  of  implementing  a portfolio  program  at  the 
elementary  level,  with  research  conducted  on  the  validity  and  reliability  of  its 
portfolio  element  data  collection  procedures  (Hall  & Homan,  1995;  Homan, 
1994).  Furthermore,  the  Title  I program  has  summarized  information  contained 
in  the  student  portfolios  using  the  Title  I Student  Outcomes  Instrument  (SOI)  and 
reported  these  data  to  the  State  Department  of  Education  for  the  purposes  of 
Title  I program  evaluation  since  the  1990-91  school  year.  In  addition  to  portfolio 
records,  student  achievement  data  for  elementary  school  students  in  grades  3 
through  5 were  also  collected  using  the  Stanford  Achievement  Test,  8th  Edition, 
and  for  students  in  grade  4,  the  performance-based  Florida  Writing  Assessment. 
The  availability  of  these  types  of  data  provided  a unique  opportunity  to  assess 
student  achievement  across  the  three  types  of  measurement  methods  and  to 
examine  the  effects  of  measurement  method  on  conclusions  about  (a) 
performance  of  individual  students  and  (b)  overall  Title  I program  effectiveness. 
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LeMahieu  et  al.  (1995)  have  advocated  the  use  of  portfolio  assessment  at  the 
district  level,  noting  that  information  from  these  assessments  can  play  a vital 
role  in  influencing  instructional  practices. 

Sample 

For  this  study,  the  researcher  obtained  data  for  students  in  grades  3 
through  5 who  were  enrolled  in  seven  Title  I Schoolwide  programs  in  a 
middle-sized  suburban/rural  district  in  west-central  Florida  during  the  1995-96 
school  year.  The  data  from  approximately  1 ,700  students  who  had  both  SAT 
scores  and  SOI  instrument  scores  as  well  as  Florida  Writing  Assessment 
Scores  (at  4th  grade  only)  were  used  in  the  study  (Table  4). 

Table  4 

Number  of  Students  in  Sample  bv  Grade 


Grade 

N 

N of  Classrooms 

N of  Students  Per  Class 

3 

536 

60 

1-29 

4 

559 

55 

1-31 

5 

647 

62 

1-33 

Total  Sample 

1,742 

103 

1-33 
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Data  Collection  Design 

In  this  correlational  study,  data  on  the  Florida  Writing  Assessment,  the 
Stanford  Achievement  Test,  and  the  Student  Outcomes  Instrument  were 
included  in  the  analyses.  All  three  sources  of  achievement  data  were  collected 
during  the  spring  of  1996.  Fourth-grade  students  were  assessed  on  the  Florida 
Writing  Assessment  in  late  January.  Grades  3 through  5 students  were 
assessed  on  the  Stanford  Achievement  Test  in  mid-April.  Student  Outcomes 
Instrument  data,  based  upon  recent  work  samples  in  the  student  portfolio,  were 
collected  during  the  month  of  May. 

Instrumentation— Student  Outcomes  Instrument 
The  SOI  is  a summary  instrument  that  enables  teachers  to  provide 
ratings  of  individual  student  proficiency  in  each  of  the  academic  areas  of 
reading,  writing,  and  mathematics.  Within  each  academic  area,  several 
different  performance  features  are  rated  after  reviewing  a student’s  portfolio. 

The  development  of  the  SOI  began  during  the  1990-91  school  year,  with  the 
assistance  of  district-level  curriculum  and  assessment  staff,  teachers,  reading 
specialists,  guidance  counselors,  and  school-based  administrators.  The 
features  that  are  rated  on  the  SOI  were  selected  to  be  congruent  with  the 
districts  curriculum  frameworks  and  student  outcomes  in  each  academic  area. 
The  system  was  revised  for  the  1995-96  school  year  to  conform  to  the 
requirements  of  the  1994  reauthorization  of  Title  I.  To  this  end,  groups  of 
content  area  experts  as  well  as  psychometricians  from  the  Educational  Testing 
Service’s  Regional  Technical  Assistance  Center  participated  in  the  process  of 
revising  the  instrument.  The  process  used  in  the  development  of  the  SOI  was 
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similar  to  the  one  used  in  defining  quality  work  in  the  reading  domain  and  how 
to  measure  it  (Green,  Hensley,  Jorgensen,  McDevitt,  &Wolfe,  1995).  Of  primary 
consideration  in  the  revision  process  was  that  the  SOI  would  have  sufficient 
psychometric  quality  that  the  results  could  be  used  for  high  stakes  annual 
evaluation  of  the  Title  I program,  while  allowing  flexibility  in  the  specific  pieces 
included  in  the  portfolio  (a  topic  that  was  studied  in-depth  by  LeMahieu  et  al., 
1995). 

Scoring  Features  of  the  SOI 

There  are  two  forms  of  the  SOI,  one  for  the  primary  grades  (1-2)  and  a 
second  for  the  intermediate  grades  (3-5).  The  forms  are  presented  in  Figures  2 
and  3.  In  the  reading  domain,  the  features  are  derived  from  reader  response 
theory  in  which  reading,  as  well  as  other  forms  of  communication,  is  viewed  as 
a process  of  constructing  and  extending  meaning  through  complex  contextual 
interactions  (Langer,  1990). 

Writing  is  viewed  as  a process  of  constructing,  examining,  and  extending 
meaning  for  a variety  of  audiences.  In  the  writing  domain,  the  portfolio  ratings 
are  made  on  five  process  features  at  the  primary  grades:  the  support  of  the 
topic,  the  use  of  appropriate  and  expressive  language,  organization  and 
sequencing,  clarity  of  ideas,  and  the  use  of  standard  conventions.  At  the  upper 
elementary  grades,  four  features  are  rated:  the  use  of  details  and  supporting 
ideas,  organization,  the  focus  on  the  topic,  and  the  use  of  standard  conventions. 
As  with  the  reading  domain,  the  level  of  sophistication  expected  increases  with 
grade  level. 
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PRIMARY  (1-2)  STUDENT  OUTCOMES  INSTRUMENT  SPRING  1995-96 


ID  Number  » 4 digit  School  Fish  » followed  by  6 digit  Student  lO  » 

Student  Name  (Print) 

Teacher  Name  (Print) 

Special  Codes;  Column  A:  Grade  (tsGrade  1.  2»  Grade  2) 

Columns  B-D:  Spring  Instructional  Reading  Level 
B*C;  Emergent/Beginning.  01*19 

0;  Transitional/Proficient/Matunng  use:  1=Eariy  trans..  2=Middle  trans.. 
3=Late  trans..  4sEarly  prof..  5*  Middle  Prof.,  6=Prof,,  7=Matunng 


D = NOT  TAUGHT 


■GENERAL  PURPOSE  DATA  SHEET  II 
form  no.  70921 


C = ADVANCED  ^TAGE 
B = INDEPENDENT  STAGE 


: - A = INTRODUCTORY  STAGE 


READING  (Based  on  Reading  Records  contained  in  the  Student  Portfolio.) 

1 . Uses  appropriate  reading  strategies. 

2.  Comprehends  written  material. 

WRITING  (Based  on  Writing  Summary  Sheet  in  the  Student  Portfolio  over  time.) 

3.  Supports  topic. 

4.  Uses  appropriate  and  expressive  words. 

5.  Demonstrates  a sense  of  organization  and  sequencing. 

6.  Uses  clear  ideas. 

7.  Demonstrates  an  awareness  of  standard  conventions  (punctuation,  capitalization, 
spelling,  grammar). 

MATHEMATICS  (Based  on  student  work  samples.) 

8.  Problem  solving 

9.  Number  concepts 

10.  Number  estimation 

11.  Patterns 

12.  Spatial  sense 

13.  Measurement 

14.  Comparing  data  and  graphing 

15.  Math  communication 
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Reprinted  with  permission  of  NCS. 
Figure  2.  Primary  Student  Outcomes  Instrument. 
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INTERMEDIATE  (3-5)  STUDENT  OUTCOMES  INSTRUMENT  SPRING  95-9S 


Z S 2 "5)C®®®©(2)i©®©®C0  ~ ® C' ©■ 

G0©0e0Gooo|ooo©eo0oeo 
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:©  - 0 300®  :® 0 00® 0000 '0 
r©  1 03030  3000  0G0®0000' 
30,03®  3 © '3  0 0 ® '3  0 3 0 3 -T  '7^ 


ID  Number  = 4 digit  School  Fish  # followed  by  6 digit  Student  ID  # 
Student  Name  (Print)^ 


Teacher  Name  (Print) 

Special  Codes:  Column  A:  Grade  ( SsGrade  3.  4=Grade  4.  5s  Grade  5) 
Columns  B-0;  Spring  Instnjctional  Reading  Level 
B-C;  Emergent/Beginning,  01-19 

D;  Transitional/Proficient/Maturing  use:  l=Earty  trans..  2=Middle  trans.. 
3=Late  trans.,  4sEarfy  prof.,  5«  Middle  Prof..  6*Prof..  7sMaturing 


E — 


D = NOT  TAUGHT 


GENERAL  PURPOSE  DATA  SHEET 
form  no.  70S2 1 


C = ADVANCED  STAGE 
B = INDEPENDENT  STAGE 


S£  NO  : o€NClL  ONLv 


A = INTRODUCTORY  STAGE 
READING  (Based  on  Reading  Records  contained  in  the  Student  Portfolio.) 

1 . Uses  appropriate  reading  strategies. 

2.  Comprehends  written  material. 

3.  Reads  different  types  of  materials  according  to  purpose. 

WRITING  (Based  on  Writing  Summary  Sheet  in  the  Student  Portfolio) 

4.  Uses  details  and  supporting  ideas. 

5.  Demonstrates  organization. 

6.  Focuses  on  topic. 

7.  Uses  standard  conventions  (punctuation,  capitalization,  spelling,  grammar), 
MATHEMATICS  (Based  on  student  work  samples.) 

8.  Problem  solving 

9.  Whole  number  and  fraction  concepts  and  operations 

10.  Number  estimation 

11.  Patterns 

12.  Measurement 

13.  Geometry 

14.  Collects,  organizes,  and  interprets  data 

15.  Math  communication 


0 


-Ttr 


0 0 ® ® ' ©■ 

•®  0 3 0 

-5-  ■ 0 0 ® ()■ 

“S'  r D 3 , r 

-* — = — ^ 


Reprinted  with  permission  of  NCS. 


Figure  3.  Intermediate  Student  Outcomes  Instrument. 
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The  mathematics  domain  is  defined  by  four  content  features  and  four 
process  features.  These  features  represent  a close  adaptation  of  the  widely 
known  NCTM  Curriculum  and  Evaluation  Standards  for  School  Mathematics 
(National  Council  of  Teachers  of  Mathematics,  1989).  The  content  features  are 
similar  to  traditional  mathematics  content  objectives  (i.e.,  measurement, 
geometry)  unlike  the  process  features  (i.e.,  problem  solving  and 
communication).  Again,  the  content  of  the  outcomes  and  process  features 
differs  slightly  for  the  primary  and  upper  elementary  forms. 

Based  on  the  student’s  structured  portfolio  records  for  each  academic 
area,  teachers  review  the  student  work  samples  collected  during  the  spring 
(May)  data  collection  period,  along  with  any  summary  sheets.  Then,  the  teacher 
records  the  student’s  level  of  proficiency  on  the  scannable  Student  Outcomes 
Instrument  tom.  The  form  utilizes  a 3-point  scale  to  depict  a student’s  academic 
stage,  including  “Introductory  Stage,”  “Independent  Stage,”  and  “Advanced 
Stage.”  [Note:  The  term  “Independent  Stage”  is  similar  to  “proficient.”  This 
alternative  designation  was  used  in  order  to  eliminate  confusion  as  the  term 
“proficient”  had  been  widely  used  in  the  district  in  the  context  of  reading  level 
designation.]  In  addition,  each  dimension  could  receive  a “Not  Taught”  (NT) 
rating,  signifying  that  the  teacher  did  not  cover  a particular  dimension  within  a 
domain.  Teachers  were  asked  to  keep  the  use  of  the  NT  rating  to  a minimum. 

Teachers  have  received  continual  in-depth  training  in  the  district’s 
curriculum  frameworks  and  student  outcomes.  This  background  knowledge, 
coupled  with  training  in  the  use  of  the  performance  portfolio  system  and  its 
supporting  documentation,  aids  teachers  in  assigning  ratings  to  each  student. 


57 


To  ensure  consistency  in  interpretation  and  record  keeping,  teachers  participate 
in  a training  session  early  in  the  school  year,  followed  by  small  group 
discussion  of  the  bench  marking  concepts.  For  each  academic  area,  domain 
scores  are  calculated  based  upon  the  sum  of  the  individual  item  scores  within  a 
given  academic  domain.  This  SOI  scoring  and  rating  process  is  summarized  in 
Table  5. 

Portfolio  Contents 

The  purposes  for  maintaining  portfolio  records  for  students  include  (a)  to 
communicate  an  evolving  portrait  of  a student’s  growth  in  learning,  (b)  to 
capture  samples  of  the  student’s  performance,  and  (c)  to  provide  a guide  for 
on-going  goal  setting.  The  SOI  portfolio  includes  a table  of  contents,  reading 
portfolio  tasks,  writing  portfolio  tasks,  mathematics  portfolio  tasks,  report  card 
marks,  and  teacher  anecdotal  records  (including  goals  and  reflections  for  the 
year). 

Reading  portfolio  tasks 

The  SOI  reading  assessment  tasks  revolve  around  the  running  record 
and  are  comprised  of  several  activities  that  require  the  student  to  construct, 
examine,  and  extend  meaning  through  reading  activities.  Elements  contained 
in  the  reading  portfolio  include  (a)  table  of  contents,  (b)  running  records,  (c) 
story  retelling  record,  (d)  comprehension  checks,  and  (e)  reading  log.  The  table 
of  contents  provides  a summary  of  the  work  that  is  contained  in  the  portfolio. 
Running  records  allow  the  teacher  to  ascertain  the  student’s  instructional 
reading  level  and  is  based  on  “authentic”  leveled  reading  selections.  Another 
means  by  which  teachers  collect  information  pertaining  to  a student’s  reading 
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Table  5 

Student  Outcomes  Instrument  Scoring  and  Rating  Procedures 


Reading 


Writing 


Mathematics 


1 .  Teachers  administer  the 
Running  Record  and 
determine  the  student’s 
instructional  reading 
level,  including 
comprehension  and  the 
use  of  appropriate 
decoding  strategies. 


2.  Teachers  review  the 
student’s  Reading  Log, 
which  is  used  as  an 
indication  of  the  types  of 
materials  the  student  has 
read  throughout  the 
school  year. 

3.  On  the  scannable 
Student  Outcomes 
Instrument  form, 
teachers  record  the 
student’s  instructional 
reading  level  and  assign 
one  of  three  levels  of 
proficiency  to  each  item 
within  the  reading 
domain. 


4.  Aggregate  domain 
scores  for  the  area  of 
reading  are  calculated, 
based  upon  the 
individual  item  scores 
within  the  reading 
domain. 


1 . Students,  along  with 
teachers,  select  the 
year-end  writing  sample 
to  be  included  in  the 
portfolio.  Teachers 
review  the  sample  and 
record  data  on  the 
elements  found  in  the 
Writing  Summary  Sheet, 
by  examining  the  first 
and  final  copies  (drafts) 
of  the  sample. 

2.  Teachers  review  the 
student’s  Title  Tally, 
which  contains  a list  of  all 
titles  published 
throughout  the  school 
year. 


3.  On  the  scannable 
Student  Outcomes 
Instrument  form, 
teachers  assign  one  of 
three  levels  of 
proficiency  to  each  item 
within  the  writing 
domain,  based  upon  the 
Writing  Summary  Sheet 
year-end  writing  sample 
and  Title  Tally. 

4.  Aggregate  domain 
scores  for  the  area  of 
writing  are  calculated, 
based  upon  the 
individual  item  scores 
within  the  writing 
domain. 


1 . Teachers  administer 
assessments  in  math, 
collect  work  samples,  and 
document  classroom 
behavior  during  the  May 
data  collection  period. 


2.  Teachers  review  the  SOI 
instructional  and 
advanced  stage 
benchmarks. 


3.  On  the  SOI  form,  teachers 
assign  one  of  three  levels 
of  proficiency  to  each  item 
within  the  math  domain, 
based  upon 
assessments,  work 
samples,  and  observation 
in  each  area. 


4.  Aggregate  domain  scores 
for  the  area  of  math  are 
based  upon  the  individual 
item  scores  within  the 
math  domain. 
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comprehension  is  through  the  writing  summary.  This  task  involves  the  student 
summarizing  a story’s  main  features  in  written  form.  The  reading  log  is  a record 
of  the  materials  read  by  the  student  during  the  school  year.  These  pieces  of 
evidence,  in  the  context  of  the  difficulty  level  of  the  reading  material  on  which 
the  student  is  instructional,  provide  a framework  in  which  teachers  are  able  to 
assign  ratings  of  “instructional,”  “independent,”  or  “advanced”  on  the  SOI  form. 

The  running  record.  This  entry  is  an  individual  assessment  of  oral 
reading  performance  and  yields  three  scores.  The  scores  are  based  on  word 
accuracy,  error  rate,  and  self-correcting  behavior  while  reading  aloud.  The 
running  record  also  provides  information  on  the  types  of  strategies,  such  as 
cueing  systems,  that  the  student  is  using  to  derive  meaning  from  the  passage. 

At  least  two  running  records  must  be  entered  in  the  portfolio,  collected  between 
the  fourth  and  sixth  weeks  after  the  beginning  of  school  (or  as  soon  as  the 
student  enters)  and  at  the  end  of  the  school  year.  Passages  selected  for 
reading  by  the  student  are  from  a leveled  list  developed  by  the  district’s  portfolio 
validation  committee  and  are  unfamiliar  material  for  the  student  (see  Figure  4). 

A student  is  allowed  one  silent  reading  prior  to  the  administration  of  the  running 
record.  All  teachers  are  given  training  in  administering  running  records  and  use 
them  on  a routine  basis  as  a part  of  the  district's  language  arts  program. 
Students  who  begin  in  the  program  as  nonreaders  (NR)  cannot  be  assessed  on 
running  records  and  are  coded  as  such.  The  student’s  instructional  reading 
level  (e.g.,  the  level  at  which  the  student  can  read  with  90%  to  94%  accuracy 
with  at  least  70%  comprehension)  is  recorded  on  the  SOI  form. 
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Running  Record  Book  Selection  List 


Stage/Level  Title 

Publisher 

Pages  to  Read 

Teachers 

Retelling 

RpaH  AlniiH 

Emergent 

Readers 

1 

AZoo 

Rigby 

Complete  Text 

Yes 

No 

.2 

Faces 

Wright  Group 

Complete  Text 

Yes 

No 

3 

In  My  Bed 

Rigby 

Complete  Text 

Yes 

No 

4 

The  Scarecrow 

Rigby 

Complete  Text 

Yes 

No 

5 

Wake  Up,  Mom! 

Wright  Group 

Complete  Text 

Yes 

No 

5 

I'm  Bigger  Than  You 

Wright  Group 

Complete  Text 

Yes 

No 

6 

Bread 

Wright  Group 

Complete  Text 

Yes 

No 

6 

Where  Are  You 
Going,  Aja  Rose? 

Wright  Group 

Complete  Text 

Yes 

No 

7 

Where  is  Nancy? 

Rigby 

Complete  Text 

Yes 

No 

Beginning 

Readers 

9 

Come  for  A Swim 

Wright  Group 

Complete  Text 

Yes 

Yes 

10 

Pete  the  Parakeet 

Troll 

p.  6-27 

Yes 

Yes 

12 

Stop  that  Rabbit 

Troll 

p.  10-28 

Yes 

Yes 

13 

Elephant  in  Trouble 

Troll 

Complete  Text 

Yes 

Yes 

15 

Mrs.  Grindy's  Shoes 

Wright  Group 

p.  8 to  end 

Yes 

Yes 

16 

Letters  to  Mr.  James 

Wright  Group 

p.  10-16 

Yes 

Yes 

17 

The  Difficult  Day 

Rigby 

p.  20-24 

Yos 

Yes 

19 

Mv  Sloppy  Tiger  Goes 
to  School 

Wright  Group 

p.  10-16 

Yes 

Yes 

Transitional 

Early 

Trans 

Henry  & Mudge 
In  the  Green  Time 

Cynthia  Rylant 
Macmillan  Pub.  Co. 

p.  23-27 

No 

Yes 

The  Frog  Prince 

Edith  Tarcov 
Scholasbc 

p.  1-5  end 
at  2nd  para. 

No 

Yes 

The  Surprise  Party 

Annabelle  Prager 
Random  House 

p.  41-48 

No 

Yes 

Pasco  County  Schools 


rev.  April  '95 


Figure  4.  Running  record  book  selection  list. 


61 


Running  Record  Book  Selection  List 


Stage/Level 

Title 

Author 

Pages  to  Read 

Teachers 
Read  AlmiH 

Retelling 

Middle 

Say  Cheese 

Patricia  Reilly  Giff 
Dell  Publishing 

p.  58-59 

No 

Yes 

Transitional 

Nate  the  Great  and 

Marjorie  Sharmat 

p.  7-9 

No 

Yes 

the  Sticky  Case 

Dell  Publishing 

or 

Nate  the  Great  and 

Marjorie  Sharmat 

p.  37-39 

No 

Yes 

the  Sticky  Case 

Dell  Publishing 

Horrible  Harry  in 

Suzy  Kline, 

p.  3-1  to  page 

No 

Yes 

Room2B 

Scholastic 

break 

Barry  , the  Bravest 

Lynn  Hall 

p.  5-7 

No 

Yes 

St.  Bernard 

Random  House 

Late 

Transitional 

Muggy  Maggie 

Beverly  Cleary 
Avon  Books 

p.  57-60 

No 

Yes 

Proficient 

Bicycle  Rider 

M.  Scioscia  Hough- 

p.  224 

No 

Yes 

Early 

ton  Mifflin,  Lit.  Rdr  4 

or 

Proficient 

Bicycle  Rider 

M.  Scioscia  Hough- 
ton Mifflin,  Lit.  Rdr  4 

No 

Yes 

p.  229-  "Dis 
turbed  us!"  to 
"You  got  one! 

Boxcar  Children 

Gertrude  Warner 

No 

Yes 

Mountain  Top  Mystery 

Scholastic 

p.  31-32 

Jennifer  Murdle/s 

Bruce  Coville 

p.  15 

No 

Yes 

Toad 

Pocket  Books  Pub. 

Middle 

Proficient 

Little  House  on 

Roger  Lea  MacBride 

p.  41-42 

No 

Yes 

Rocky  Ridge 

Harper  Collins  Pub. 

Felita 

Nicholas  Mohr, 

p.  254 

No 

Yes 

Felita 

or 

Houghton  Mifflin,  Lit 

p.  259 

No 

Yes 

Felita 

Rdr.  5 

or 

p.  268-269 

No 

Yes 

The  Fairly  Intelligent 
Fly 

james  Thurbcr 

No 

Yes 

The  Night  the 
Bed  Fell 

lames  Thurbcr 

No 

Yes 

Proficient 

The  Gun  Without  a 
Bang 

Robert  Sheckley 

No 

Yes 

(Ml) 

look  6 

No 

(M2) 

Book  7 

No 

(Questions 

(MJ 

look  8 

No 

Uuestions 

Maturing 

Other 

Pasco  County  Schools 

rev.  April  '96 


Figure  4--continufiri 
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The  story  retelling  record.  A vital  part  of  the  running  record  is  the  story 
oral  retelling  record  in  which  the  student  is  asked  to  tell  the  story  in  his/her  own 
words.  Five  areas  are  targeted  and  are  based  on  the  material  read  by  the 
student.  The  ares  of  focus  include  identifying  characters,  identifying  the  setting, 
stating  the  problem,  sequencing  events,  and  understanding  the  resolution  or 
ending.  A story  retelling  checklist  and  a set  of  prompts  for  conducting  these 
assessments  are  provided  to  the  teacher.  As  in  the  running  record,  if  a student 
is  a nonreader,  this  assessment  cannot  be  made. 

Comprehension  checks.  At  the  primary  level,  this  entry  usually  consists 
of  the  story  retelling  checklist.  At  the  upper  elementary  grades,  evidence  of 
comprehension  is  gleaned  through  such  means  as  oral  retelling,  story  maps, 
written  summaries  and  reports,  opinion  and  proof  summaries,  and  reading 
response  logs  and  journals. 

The  reading  log.  This  entry  consists  of  a student's  log  of  books  or  stories 
completed  during  the  year.  The  number  of  books  read  is  an  index  of  reading 
interest.  It  provides  information  on  the  variety  of  materials  that  a student  is 
reading,  both  independent  and  assigned. 

Writing  portfolio  tasks 

In  the  area  of  writing,  a minimum  of  three  samples  of  work  produced  by 
the  student  as  part  of  classroom  instruction  are  included  in  the  portfolio.  There 
are  no  specifications  of  the  forms  or  genres  of  writing  to  be  included.  These 
may  include  journal  writing,  letters,  stories,  or  book  reports.  Dated  samples  are 
used  by  teachers  to  assess  the  proficiency  of  the  student.  Information  on  each 
piece  is  recorded  on  the  Writing  Summary  Sheet,  which  is  contained  in  the 
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portfolio  (see  Figure  5).  Teachers  check  for  evidence  of  each  element  and 
include  comments  as  appropriate.  Lack  of  evidence  in  a particular  area 
provides  support  for  the  contention  that  a student  is  “introductory”  in  that  area.  If 
a particular  element  is  evident,  then  the  level  and  quality  of  the  element  is  taken 
into  consideration  when  rating  in  the  “independent”  and  “advanced”  categories 
on  the  SOI  form.  Writing  samples  have  been  included  in  the  Appendix  B. 
Mathematics  portfolio  ta.qks 

Completed  mathematics  exercises  and  unit  tests  provide  information 
regarding  performance  on  mathematics  tasks.  Teachers  review  the  exercises 
and  tests  for  processes  used  as  well  as  final  product.  The  district’s  mathematics 
series,  which  is  closely  aligned  with  the  NCTM  Standards,  provides  teachers 
with  a framework  in  which  certain  mathematical  topics  are  stressed.  This 
framework  is  then  used  to  guide  teachers  toward  particular  chapter 
assessments  or  exercises  that  pertain  to  each  of  the  items  on  the  SOI  form.  A 
judgment  of  proficiency  is  made  by  the  teacher  by  comparing  the  performance 
of  the  student  with  a set  of  benchmarks  which  include  characteristics  of 
independent  and  advanced  learners  in  the  area  of  mathematics  (see  Figure  6 
for  sample  benchmarks). 

Creating  portfolio  aggregate  score.q 

Myriad  options  for  aggregating  individual  item  scores  to  create  a total 
score  for  each  domain  on  the  SOI  exist.  The  procedure  used  in  this  study  was 
that  which  was  adopted  by  the  district  in  which  the  study  took  place.  Individual 
item  scores  (ranging  from  0-3)  were  summed  in  order  to  form  an  aggregate 
domain  score.  For  example,  in  the  area  of  mathematics,  which  consists  of  8 
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STUDENT  NAME 

SCHOOL  

Writing  Sample  #1 


TEACHER 


Date 


Title  or  Topic 


Narrative 


Type  of  Writing  (Circle  One): 
Descriptive  Expository  Persuasive 


Check  if  evident: 

1.  Exhibits  quality  in  details/uses  supporting  ideas. 

Organizes  writing. 

Stays  focused  on  topic. 

Uses  Standard  conventions  (gram/usage/mechanics). 
Demonstrates  a willingness  to  write. 

Exhibits  other  writing  qualities  such  as 


Notes 


2. 

3. 

4. 

5. 

6. 


Comments: 


Writing  Sample  #2 
Title  or  Topic 


Date 


Type  of  Writing  (Circle  One): 


Narrative  Descriptive  Expository  Persuasive 


Check  if  evident: 


1. 

2. 

3. 

4. 

5. 

6. 


Exhibits  quality  in  details/ uses  supporting  ideas. 
Organizes  writing. 

Stays  focused  on  topic. 

Uses  Standard  conventions  (gram/usage/mechanics). 
Demonstrates  a willingness  to  write. 

Exhibits  other  writing  qualities  such  as 


Notes 


Comments: 


iynans  Sample  Date Typa  of  WriUng  (Circle  One): 


Title  or  Topic 


Check  if  evident: 


Descriptive  Narrative  Expository 
Notes 


1. 

2. 

3. 

4. 

5. 

6. 


Exhibits  quality  in  details/ uses  supporting  ideas. 
(Drganizes  writing. 

Stays  focused  on  topic. 

Uses  Standard  conventions  (gram/usage/mechanics). 
Demonstrates  a willingness  to  write. 

Exhibits  other  writing  qualities  such  as 


Persuasive 


Comments: 

To  indicate  areais)  emphasized  in  instrucHon,  place  an  * in  front  of  the  appropriate  numbers). 


Figure  5.  Writing  summary  sheet. 
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PASCO 

OUTCOMES 

• Solvas  mathematical  problems  by  applying 
various  problain  aohring  strategies.  (Ml) 

• Communicates  using  the  language  and 
symbols  ol  mathematics  across 
discpiines  and  situations.  (M2) 

• Applies  reasoning  to  justify  solutions, 
thinking  processes,  and  conjectures  when 
solving  mathematical  problems.  (M3) 

* Uses  an  estimation  process  to  make 
everyday  quantitative  decisionk.  (MS) 

• Develops  a sense  for  what  numbers  mean 

and  the  relatianshps  among  them.  (M7) 

* Understarxfs  number  operations  and  the 
reiationah  'ps  between  and  among  those 
operations.  (M8) 

* Uses  computation  appropriate  to  specific 
problems.  (M9) 

* Uses  a varwfy  ol  algorithms  to  add, 
siirtract,  multiply,  and  divide  numbers. 
(M10) 

• Develops  a sparia/ sense  of  one's 
surroundings  and  the  relationships  among 
the  objects  in  them.  (Mil) 

♦ Understands  formal  geometric  principles 
(M12) 


MATHEMATICS 

INTERMEDIATE 


INDEPENDENT  STAGE 
BENCHMARK  OUTCOMES 


Understands  and  constructs  problems. 

Uses  appropriate  strategies  and  tools  to  collect  and 
organize  inlormatioa 

Draws  appropriate  cortciusions. 

Work  is  accurate. 

Expresses  information. 

Compares  common  fractions  and  decimal  fractbns  by 
drawing  diagrams  and  using  physical  models. 

Adds  arid  subtracts  common  fractions  and  decimal 
fractions. 

Explains  and  illustrates  relationships  among 
operations  for  whole  numbers,  fractions,  and  mixed 
numbers. 

Knows  and  applies  order  of  operatba  (x,  /,  ♦,  -) 

Uses  standard  algorithms. 

Computes  accurately. 

Extends  simple  mathematical  models  to  the  real 
work! 


Recognizes  and  constructs  patterns  with  multple 
attributes. 


Constructs  descriptive  algorithms  of  patterns  with 
multiple  attributes. 


ADVANCED  STAGE 
BENCHMARK  OUTCOMES 

* Understands  and  constructs  non  routine 
problems. 

* Uses  appropriate  strategies  and  tools  to  collect, 
organize  and  analyze  informatba 

* Draws  appropriate  conclusions  and  relates  finding 
to  future  events. 

* Work  is  accurate. 

* Perseveres  through  challenging  works. 

* Exprosses  and  defends  information  through 
various  forms. 

* Expresses  equivalent  forms  of  whole  numbers, 
common  fractions,  mixed  numbers  and  decimal 
fractions. 

* Uses  fraction  notation  to  express  probabilities. 

* Generates  questions. 

* Uses  appropriate  strategies  to  collect  and  organize 
information. 

* Finds  relationships  and  patterns. 

* Recognizes  that  data  can  be  distoned. 

* Expresses  information  through  various  forms  for 
presentation,  (e.g.:  multi  media,  graphs,  charts 
writing) 

* Interprets,  justifies  and  translates  resulu. 


Sample  Mathematics  Benchmarks  for  the  Student  Outcomes 
Instrument. 
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items,  scores  ranged  from  0-24.  In  establishing  the  cutoff  in  determining 
individual  student  proficiency  level,  a minimum  score,  based  upon  the  domain 
score  that  would  reflect  a majority  of  “independent”  or  “advanced”  stage  ratings, 
was  established.  A sample  student  profile  for  the  mathematics  domain  is 
presented  in  Table  6.  In  the  area  of  mathematics,  5 or  more  items  represent  a 
majority,  so  in  this  domain  the  cutoff  was  set  at  13  (e.g.,  2 points  for  each  of  the 
five  items  on  which  a student  received  at  least  an  “independent”  rating  and  1 
point  for  each  of  the  remaining  three  items). 

Table  6 

Sample  Student  Outcomes  Instrument  Mathematics  Domain  Profile 


Item 

Rating  Points 

Problem  solving 

Introductory 

1 

Whole  number  and  fraction  concepts  and  operations 

Independent 

2 

Number  estimation 

Independent 

2 

Patterns 

Independent 

2 

Measurement 

Independent 

2 

Geometry 

Introductory 

1 

Collects,  organizes,  and  interprets  data 

Introductory 

1 

Math  communication 

Independent 

2 

Total  Math  Domain  Score 

13 
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Portfolio  Validation  Study 

In  order  to  investigate  the  validity  of  using  portfolio  records  in  assessing 
student  achievement  within  the  school  district,  a committee,  including  members 
from  a nearby  university,  was  formed.  As  described  by  Hall  (1995),  a 3-year 
longitudinal  validation  of  language  arts  portfolios  for  the  elementary  grades  was 
conducted.  Pertinent  elements  of  the  study  included  the  investigation  of  the 
relationship  of  print  concepts  to  reading  levels  at  the  primary  level  (Bright, 

1995),  the  analysis  of  longitudinal  running  record  data  (Homan,  1995),  and 
ratings  of  writing  samples  (Banerji,  1995). 

Bright’s  (1995)  investigation  of  the  validity  of  a measure  of  print  concepts 
as  a predictor  of  subsequent  reading  performance  among  primary  grade 
children  (N=41 1)  yielded  a correlation  between  the  two  measures  of  .56.  This 
relationship  formed  the  basis  of  the  contention  that  children  who  are  unable  to 
demonstrate  a knowledge  of  basic  print  concepts  are  at  risk  of  being 
unsuccessful  in  learning  to  read.  Homan’s  (1995)  analysis  of  running  record 
information  for  317  primary  students  highlighted  the  following  trends:  (a)  At  the 
end  of  first  grade,  18.2%  of  the  students  were  reading  at  an  Emergent  level;  by 
the  end  of  second  grade,  that  figure  had  decreased  to  4.4%;  (b)  at  the  end  of 
second  grade,  77%  of  the  students  were  reading  and  comprehending  at  least 
Transitional  material;  and  (c)  story  retelling  data  supported  the  expectation  that 
as  student  word  recognition  increases  and  higher  reading  levels  are  read,  the 
ability  to  comprehend  stays  intact.  Banerji  (1995)  analyzed  writing  data 
collected  from  284  third  graders  who  responded  to  one  of  four  student-selected 
prompts.  Passages  were  rated  on  a 5-point  holistic  rating  scale.  Two  years 
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later,  256  fifth  graders  repeated  the  procedure,  with  130  cases  of  the  original 
sample  remaining  in  the  cohort.  Rater  agreement  between  pairs  of  raters 
ranged  from  46%  to  60%,  with  5%  to  10%  of  the  papers  discrepant  by  at  least  2 
points  on  the  rating  scale.  Correlation  coefficients  among  pairs  of  ratings  were 
found  to  be  moderate,  ranging  from  .40  to  .75  (p.  13).  Taken  together,  the 
results  of  these  studies  suggest  that  information  contained  in  the  student 
portfolio  is  a meaningful  data  source  in  assessing  student  achievement. 

Instrumentation-Stanford  Achievement  Tn.qt 
As  part  of  the  district-wide  assessment  program,  the  Stanford 
Achievement  Test,  Eighth  Edition  (SAT8),  Form  J,  is  administered  district  wide 
to  students  in  grades  2 through  9 each  spring.  All  students  take  the  Reading, 
Language,  and  Mathematics  subtests.  The  SAT8  is  a norm-referenced, 
standardized  achievement  test.  Spring,  1988,  norms  were  used  for  scoring. 

The  norm  group  for  the  SAT  consisted  of  310,000  K-12  students  in  the  spring 
and  fall  of  1988.  The  reliability  estimates,  reported  in  terms  of  KR-20,  for  grades 
2 and  above  range  from  .80  to  .97.  In  this  study,  scores  from  the  Reading 
Comprehension,  Mathematics  Concepts  and  Applications,  and  Language 
subtests  were  analyzed. 

Instrumentation— Florida  Writing  Assessment 
As  part  of  the  State  accountability  plan,  the  Florida  Writing  Assessment 
Program  requires  each  district  to  administer  the  assessment  district  wide  to 
students  in  grades  4,  8,  and  10  each  spring.  The  Writing  Assessment  is  a 
controlled  process  writing  sample  in  which  students  are  given  45  minutes  to 
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respond  to  either  a narrative  or  expository  prompt  at  the  fourth-grade  level  and 
persuasive  or  expository  prompts  at  grades  8 and  10. 

Papers  are  scored  on  a holistic  6-point  scale  by  a group  of  trained  raters 
at  a central  location  in  the  state.  The  raters  consider  four  elements:  focus, 
organization,  support,  and  conventions.  In  1994,  the  Florida  Writing 
Assessment  was  administered  to  140,977  fourth  graders,  121,790  eighth 
graders,  and  102,057  tenth  graders  (State  of  Florida  Department  of  State,  1994, 
p.  2).  Prompts  were  reviewed  for  possible  biases  relating  to  gender,  religion, 
race,  or  ethnic  background. 

Method  of  Analysis 

In  order  to  investigate  questions  1 and  2,  generalizability  theory  was 
applied  to  estimate  the  internal  consistency  of  scored  features  of  the  student 
work  artifacts  in  the  SOI.  Generalizability  coefficients  were  estimated  for  each 
grade  level,  ethnic  group,  gender,  and  socioeconomic  status.  The  absolute  and 
relative  standard  errors  of  measurement  were  estimated  and  compared  across 
these  same  groups.  It  was  expected  that  the  generalizability  coefficients  and 
the  standard  error  would  remain  relatively  consistent  across  subgroups  and 
measurement  method. 

To  investigate  questions  3 and  4,  a cross-tabulation  of  the  proportion  of 
students  in  each  of  the  subgroups  consistently  meeting  objectives  was 
calculated  for  each  of  the  methods.  Cohen's  Kappa  was  estimated  for  each 
subject  area  and  subgroup.  It  was  expected  that  while  differences  among  the 
percentages  of  students  classified  as  proficient  may  exist  between  groups. 


70 


these  differences  would  remain  relatively  consistent  across  measurement 
method. 

For  question  5,  a correlation  matrix  was  generated  for  the  SAT8,  the 
writing  assessment,  and  the  SOI.  Convergent  and  discriminant  validity 
coefficients  were  examined.  It  was  expected  that  performance  within  a given 
academic  area  (trait)  would  be  more  highly  correlated  than  performance  across 
academic  areas  measured  by  the  same  instrument  (method).  For  question  6,  a 
pooled  within-class  means  analysis  was  used  to  generate  the  correlation 
matrix.  The  effect  of  group  membership  is  commonly  referred  to  in  the  literature 
as  the  “frog  pond”  or  “comparison”  effect  (Burstein,  1980,  p.  201).  This  effect  is 
typically  measured  by  the  impact  of  the  individual’s  deviation  from  the  mean  of 
the  group  (Xj,  - |1  j)  on  individual  level  outcomes  (Y.p  (Burstein,  1980,  p.  201). 

In  this  study  i represented  the  individual  and  j the  classroom. 

Summary 

Data  from  a sample  of  1,742  students  in  grades  3 to  5 enrolled  in  one 
Florida  district’s  Title  I program  were  analyzed.  Data  were  drawn  from  recent 
administrations  of  the  Stanford  Achievement  Test,  a norm-referenced, 
standardized  achievement  test;  the  Florida  Writing  Assessment,  a controlled 
process  writing  sample  four  fourth-grade  students;  and  the  portfolio-based 
Student  Outcomes  Instrument  (SOI),  a summary  instrument  that  provides  ratings 
of  student  proficiency  in  the  areas  of  reading,  writing,  and  mathematics. 

Internal  consistency  coefficients  for  several  features  of  portfolio  artifacts 
and  the  absolute  and  relative  standard  errors  of  measurement  were  estimated 
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for  each  subgroup.  A cross-tabulation  of  the  percentage  of  students  in  each  of 
the  subgroups  meeting  the  proficiency  criteria  was  conducted  for  each  of  the 
measurement  methods.  Cohen’s  Kappa  was  calculated  as  an  index  of  the 
decision  consistency  across  measurement  methods.  A correlation  matrix  was 
generated  for  the  SAT,  the  writing  assessment,  and  the  SOI.  Convergent  and 
discriminant  validity  coefficients  were  examined,  and  Steiger’s  z*  was  estimated 
for  each  contrast  of  interest.  As  a follow-up  procedure,  exploratory  factor 
analysis  was  conducted. 


CHAPTER  4 

RESULTS  AND  DISCUSSION 

The  purpose  of  this  study  was  to  investigate  the  relationship  between 
student  achievement  outcomes  in  a district  Title  I program  as  determined  by 
three  measurement  methods.  The  specific  questions  to  be  addressed  were  as 
follows  for  each  grade  level  and  subject  area: 

1 . What  is  the  internal  consistency  of  the  teacher  ratings  of  scored 
features  of  student  artifacts  in  a standardized  portfolio  assessment? 

2.  Do  the  internal  consistency  coefficients  of  the  portfolio-based  ratings 
vary  significantly  across  grade  level,  ethnic  group,  gender  group,  and 
socioeconomic  group?  How  do  the  absolute  and  relative  standard  errors  of 
measurement  vary? 

3.  What  proportion  of  students  are  consistently  classified  as  proficient  or 
not  proficient,  based  on  each  possible  pair  of  three  assessment  methods? 

4.  Does  the  consistency  of  classification  on  different  pairs  of  assessment 
methods  vary  across  grade  level,  ethnic  group,  gender  group,  and 
socioeconomic  group? 

5.  What  is  the  pattern  of  convergent  and  discriminant  validity  coefficients 
between  the  MC-NRT,  a state-level  performance  assessment  (writing),  and  the 
portfolio-based  instrument  in  subject  areas  of  reading,  mathematics,  and 
language  arts? 
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6.  Controlling  for  class  mean  level  of  performance,  what  is  the  pattern  of 
convergent  and  discriminant  validity  coefficients? 

Of  principal  concern  in  this  study  was  the  consistency  of  student 
achievement  outcomes  across  measurement  methods.  Also  of  interest  was  the 
issue  of  the  variability  in  student  score  patterns  across  various  groups  (e.g., 
grade  level,  racial/ethnic,  gender,  and  socioeconomic  level).  Subgroups  with  at 
least  30  cases  were  included  in  the  analyses. 

Descriptive  Statistics 

Table  7 presents  the  number  of  students  included  in  the  analyses  by 
subgroup.  Tables  8-10  present  mean  scores  and  standard  deviations  for  the 
Stanford  Achievement  Test  (SAT),  8th  edition,  the  Student  Outcomes  Instrument 
(SOI),  and  the  Florida  Writing  Assessment  (grade  4 only)  for  all  racial  groups 
and  grade  levels.  Table  11  presents  mean  scores  and  standard  deviations  for 
the  SAT,  SOI,  and  Florida  Writing  Assessment  for  gender  groups.  Tables  12-14 
present  mean  scores  and  standard  deviations  for  the  SAT,  SOI,  and  Florida 
Writing  Assessment  for  all  socioeconomic  groups.  For  the  SAT,  normal  curve 
equivalent  (NCE)  scores  are  reported.  For  the  Student  Outcomes  Instrument, 
the  range  of  possible  scores  in  the  Reading  Domain  was  0-16,  in  the  Language 
Domain  0-12,  and  in  the  Math  Domain  0-24.  Reading  scores  ranged  from  an 
average  of  9.8  in  grade  3 to  12.2  in  grade  5.  Language  scores  ranged  from  an 
average  of  6.5  in  grade  3 to  7.5  in  grade  5.  Math  scores  ranged  from  an 
average  of  12.9  in  grade  3 to  12.5  in  grade  4.  As  with  the  Stanford 
Achievement  Test,  whites  tended  to  score  higher  than  the  other  racial/ethnic 
groups,  with  the  exception  of  the  math  domain  at  grades  4 and  5,  where  blacks 
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Table  7 

Number  Tested  by  Grade  Level.  Ethnic  Group.  Gender,  and  Socioeconomic 
Level 


Grade  Grouping  Variable 

Subgroup 

Reading 

Math 

Language 

3 Total 

536 

533 

505 

Racial/Ethnic 

Whites 

434 

432 

410 

Blacks 

47 

46 

40 

Hispanics 

55 

55 

55 

Gender 

Females 

242 

240 

225 

Males 

294 

293 

280 

Socioeconomic  Level 

Not  Eligible 

151 

150 

143 

Free 

353 

351 

331 

Reduced 

32 

32 

31 

4 Total 

559 

556 

486 

Racial/Ethnic 

Whites 

444 

445 

388 

Blacks 

55 

55 

51 

Hispanics 

60 

56 

47 

Gender 

Females 

280 

276 

254 

Males 

279 

280 

232 

Socioeconomic  Level 

Not  Eligible 

183 

183 

171 

Free 

331 

328 

277 

Reduced 

45 

45 

38 

5 Total 

647 

648 

644 

Racial/Ethnic 

Whites 

498 

497 

495 

Blacks 

66 

67 

67 

Hispanics 

83 

84 

82 

Gender 

Females 

319 

321 

317 

Males 

328 

327 

327 

Socioeconomic  Level 

Not  Eligible 

177 

177 

175 

Free 

398 

400 

397 

Reduced 

72 

71 

72 
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Table  8 

Means  (and  Standard  Deviations)  for  SOI  and  SAT  for  All  Racial  Groups. 
Grade  3 


SOI 

Subgroup 

Reading 

Math 

Language 

Whites 

10.1 

13.1 

6.6 

(4.0) 

(4.7) 

(2.3) 

Blacks 

8.7 

12.3 

6.4 

(3.8) 

(4.4) 

(2.2) 

Hispanics 

8.8 

12.6 

6.4 

(3.6) 

(3.5) 

(1.9) 

Total 

9.8 

12.9 

6.5 

(4.0) 

(4.5) 

(2.2) 

SAT 

Subgroup 

Reading 

Math 

Language 

Whites 

44.8 

48.7 

43.9 

(20.0 ) 

(20.5  ) 

(20.4) 

Blacks 

31.5 

39.9 

37.1 

(16.2) 

(19.8) 

(15.6) 

Hispanics 

34.7 

46.7 

37.6 

(16.2) 

(19.3) 

(14.8) 

Total 

42.6 

47.7 

42.7 

(19.9) 

(20.5) 

(19.7) 

Note.  Subtests  reported  for  the  SAT  are  Reading  Comprehension,  Math 
Application,  and  Total  Language. 
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Table  9 

Means  (and  Standard  Deviations^  for  SOI  and  SAT  for  All  Racial  Groups. 
Grade  4 


Subgroup 

SOI 

Reading 

Math 

Language 

Whites 

11.5 

14.5 

7.4 

(3.6) 

(4.6) 

(2.3) 

Blacks 

11.2 

14.7 

7.2 

(3.3) 

(4.7) 

(2.3) 

Hispanics 

9.7 

14.3 

7.3 

(4.2) 

(4.7) 

(2.2) 

Total 

11.3 

14.5 

7.4 

(3.7) 

(4.6) 

(2.3) 

Subgroup 

SAT 

Reading 

Math 

Language 

Whites 


Blacks 


Hispanics 


42.6 

50.2 

46.4 

(21.6) 

(20.5) 

(18.5) 

38.9 

47.6 

42.3 

(16.5) 

(19.1) 

(16.9) 

33.5 

45.8 

42.2 

(19.1) 

(18.5) 

(15.1) 

41.2 

49.5 

45.5 

(21.1) 

(20.2) 

(18.1) 

Total 
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Table  9--continued. 


Subgroup 

Florida  Writes 

Whites 

2.1 

(-9) 

Blacks 

2.0 

(.8) 

Hispanics 

1.9 

(-9) 

Total 

2.1 

(-9) 

Nota  Subtests  reported  for  the  SAT  are  Reading  Comprehension,  Math 
Application,  and  Total  Language. 
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Table  10 

Means  (and  Standard  Deviations^  for  SOI  and  SAT  for  All  Racial  Groups. 
Grade  5 


SOI 

Subgroup 

Reading 

Math 

Language 

Whites 

12.6 

14.3 

7.5 

(3.1) 

(4.4) 

(2.2) 

Blacks 

11.1 

15.5 

7.6 

(2.8) 

(4.4) 

(2.2) 

Hispanics 

10.7 

14.1 

7.2 

(3.6) 

(4.5) 

(2.3) 

Total 

12.2 

14.4 

7.5 

(3.2) 

(4.4) 

(2.2) 

SAT 

Subgroup 

Reading 

Math 

Language 

Whites 

42.9 

48.8 

43.8 

(19.5) 

(19.1) 

(19.8) 

Blacks 

30.1 

36.0 

30.9 

(16.9) 

(17.4) 

(17.9) 

Hispanics 

32.3 

35.5 

31.7 

(15.8) 

(18.0) 

(14.6) 

Total 

40.2 

45.7 

40.9 

(19.5) 

(19.5) 

(19.7) 

Note.  Subtests  reported  for  the  SAT  are  Reading  Comprehension,  Math 
Application,  and  Total  Language. 
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Table  1 1 

Means  (and  Standard  Deviations^  for  SOI  and  SAT  for  Males  and  Females. 
Grades  3-5 


Subgroup 

SOI 

Reading 

Math 

Language 

3 

Females 

10.0 

12.7 

6.6 

(3.9) 

(4.2) 

(2.2) 

Males 

9.7 

13.2 

6.5 

(4.1) 

(4.8) 

(2.2) 

4 

Females 

11.7 

14.9 

7.6 

(3.5) 

(4.6) 

(2.2) 

Males 

10.9 

14.1 

7.0 

(3.9) 

(4.5) 

(2.3) 

5 

Females 

12.6 

14.7 

8.0 

(2.7) 

(4.4) 

(2.1) 

Males 

11.8 

14.1 

7.0 

(3.5) 

(4.4) 

(2.3) 

SAT 

Subgroup 

Reading 

Math 

Language 

3 Females 


Males 


4 Females 


43.5 

47.0 

44.1 

(19.9) 

(19.9) 

(19.7) 

41.8 

48.3 

41.5 

(19.9) 

(20.9) 

(19.6) 

44.2 

50.1 

48.5 

(21.4) 

(20.0) 

(17.9) 

38.3 

48.9 

42.3 

(20.4) 

(20.4) 

(17.8) 

Males 
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Table  1 1 --continued. 

SAT 

Subgroup 

Reading 

Math 

Language 

5 Females 

42.9 

46.2 

44.0 

(18.1) 

(19.6) 

(18.7) 

Males 

37.7 

45.3 

38.0 

(20.3) 

(19.5) 

(20.3) 

Subgroup 

Florida  Writes 

4 Females 

2.2 

(.8) 

Males 

1.9 

(-9) 

Not6.  Subtests  reported  for  the  SAT  are  Reading  Comprehension,  Math 
Application,  and  Total  Language. 
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Table  12 

Means  (and  Standard  Deviations)  for  SOI  and  SAT  for  All  Socioeconomic 
Groups.  Grade  3 


SOI 

Subgroup 

Reading 

Math 

Language 

Not  Eligible 

11.1 

13.9 

7.0 

(3.9) 

(4.7) 

(2.4) 

Free 

9.3 

12.5 

6.4 

(3.9) 

(4.4) 

(2.1) 

Reduced 

9.5 

13.3 

6.2 

(3.9) 

(4.8) 

(2.3) 

SAT 

Subgroup 

Reading 

Math 

Language 

Not  Eligible 

50.3 

53.6 

48.8 

(20.4) 

(19.8) 

(21.2) 

Free 

39.4 

45.2 

39.9 

(19.1) 

(20.3) 

(18.6) 

Reduced 

41.4 

47.9 

44.0 

(15.6) 

(19.8) 

(17.4) 

Note^  Subtests  reported  for  the  SAT  are  Reading  Comprehension,  Math 
Application,  and  Total  Language. 
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Table  13 

Means  (and  Standard  Deviations^  for  SOI  and  SAT  for  All  Socioeconomic 
Groups.  Grade  4 


Subgroup 

SOI 

Reading 

Math 

Language 

Not  Eligible 

12.1 

15.2 

7.8 

(3.6) 

(5.0) 

(2.4) 

Free 

10.9 

14.1 

7.1 

(3.7) 

(4.4) 

(2.2) 

Reduced 

11.5 

14.3 

7.2 

(3.7) 

(3.8) 

(1.5) 

SAT 

Subgroup 

Reading 

Math 

Language 

Not  Eligible 

47.2 

53.7 

49.4 

(22.7) 

(21.3) 

(19.9) 

Free 

37.6 

46.9 

43.0 

(19.5) 

(18.8) 

(16.8) 

Reduced 

43.9 

51.6 

47.2 

(20.3) 

(22.8) 

(16.4) 
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Table  13--continued. 


Subgroup 

Florida  Writes 

Not  Eligible 

2.2 

(.9) 

Free 

2.0 

(•8) 

Reduced 

2.1 

(.9) 

Note.  Subtests  reported  for  the  SAT  are  Reading  Comprehension,  Math 
Application,  and  Total  Language. 
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Table  14 

Means  (and  Standard  Deviations)  for  SOI  and  SAT  for  All  Socioeconomic 
Groups.  Grade  5 


SOI 

Subgroup 

Reading 

Math 

Language 

Not  Eligible 

13.1 

14.8 

7.9 

(2.7) 

(4.2) 

(2.2) 

Free 

11.8 

14.1 

7.3 

(3.3) 

(4.5) 

(2.2) 

Reduced 

12.5 

14.8 

7.7 

(3.0) 

(4.6) 

(2.1) 

SAT 

Subgroup 

Reading 

Math 

Language 

Not  Eligible 

46.2 

52.5 

46.6 

(18.0) 

(18.2) 

(19.6) 

Free 

37.2 

42.3 

37.7 

(19.6) 

(19.4) 

(19.0) 

Reduced 

42.1 

48.0 

45.0 

(18.4) 

(19.5) 

(20.3) 

Note.  Subtests  reported  for  the  SAT  are  Reading  Comprehension,  Math 
Application,  and  Total  Language. 
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performed  better  than  whites  on  average.  Females  tended  to  score  higher  than 
males,  with  the  exception  of  the  math  domain  at  grade  3.  In  terms  of 
socioeconomic  level,  the  results  for  the  SOI  closely  parallel  those  for  the 
Stanford;  students  not  eligible  for  free  or  reduced-priced  lunches  tended  to 
outperform  their  more  underprivileged  peers. 

At  grade  3,  NCE  scores  ranged  from  42.6  in  Reading  Comprehension  to 
47.7  in  Math  Application.  At  grade  4,  scores  ranged  from  41 .2  in  Reading 
Comprehension  to  49.5  in  Math  Application.  At  grade  5,  scores  ranged  from 
40.2  in  Reading  Comprehension  to  45.7  in  Math  Application.  On  average, 
whites  tended  to  score  higher  than  students  in  other  racial/ethnic  categories, 
and  females  tended  to  score  higher  than  males  (an  exception  to  this  was  in  the 
math  area  at  grade  3).  In  terms  of  socioeconomic  level,  students  who  were  not 
eligible  for  the  free  or  reduced-priced  lunch  program  tended  to  score  higher 
than  students  who  were  eligible  for  the  program.  On  average,  students  eligible 
for  the  free  lunch  program  scored  lower  than  those  eligible  only  for  the  reduced- 
priced  lunch  program. 

For  the  Florida  Writing  Assessment,  fourth-grade  students  were  randomly 
assigned  one  of  two  types  of  prompts-expository  or  narrative.  The  papers  were 
scored  on  a 6-point  Likert  scale.  The  mean  score  on  the  expository  prompt  of 
the  Florida  Writing  Assessment  was  1 .7.  The  mean  score  on  the  narrative 
prompt  was  2.4.  Across  both  prompts,  the  weighted  mean  score  was  2.1 . The 
trend  of  scores  across  subgroups  closely  paralleled  those  for  the  Stanford  and 
the  SOI;  whites  tended  to  score  higher  than  other  ethnic/racial  groups,  females 
tended  to  score  higher  than  males,  and  students  not  participating  in  the  free  or 
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reduced-priced  lunch  program  tended  to  score  higher  than  their 
underprivileged  peers. 

Effect  Size 

In  a one-way  analysis  of  variance  design,  the  effect  size  may  be 
estimated  by  the  eta-squared  statistic,  which  may  be  used  to  describe  the 
proportion  of  total  variability  explained  by  the  grouping  variable.  As  the  value  of 
eta-squared  approaches  1,  it  is  an  indication  that  the  total  variability  is 
attributable  to  differences  between  the  groups,  while  a value  close  to  0 
indicates  that  the  grouping  variable  explains  little  of  the  total  variability  (Norusis, 
1994,  p.  41).  Eta-squared  is  computed  by  the  formula 

2 

2 j ^ residual 

T1  = 1 2— 

Oy 

where  o^y  is  the  total  variance  in  Y (Keppel  & Zedeck,  1989,  pp.  51-54). 

Values  of  eta-squared  were  estimated  and  are  reported  in  Tables  15-17. 
Values  ranged  from  .0001  for  gender  groups  for  SOI  language  at  grade  3 to 
.0796  for  SAT  math  at  grade  5 for  ethnic  groups.  Cohen  (1977)  provided  a 
means  of  expressing  eta-squared  as  a function  of  the  f-statistic  and  defined 
specific  values  of  f for  "small,"  "medium,"  and  "large"  effects.  According  to 

Cohen,  a small  effect  is  defined  by  f = .10  (or  ri^  = .0099).  A medium  effect  is 

defined  by  f = .25  (or  = .0588).  A large  effect  is  defined  by  f = .40  (or  r\^  = 
.1379)  (pp.  283-287).  Using  Cohen's  guidelines,  the  majority  of  the 
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Table  15 


Ethnic  Groups 

Reading 

Math 

Language 

Grade  3 

SOI 

.0170 

.0032 

.0012 

SAT 

.0535 

.0146 

.0170 

Grade  4 

SOI 

.0241 

.0004 

.0024 

SAT 

.0187 

.0052 

.0085 

FLWR 

.0035 

Grade  5 

SOI 

.0499 

.0080 

.0022 

SAT 

.0642 

.0796 

.0714 
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Table  16 


Groups 

Reading 

Math 

Language 

Grade  3 

SOI 

.0011 

.0032 

.0001 

SAT 

.0018 

.0009 

.0042 

Grade  4 

SOI 

.0130 

.0062 

.0273 

SAT 

.0194 

.0010 

.0295 

FLWR 

.0243 

Grade  5 

SOI 

.0151 

.0041 

.0421 

SAT 

.0178 

.0005 

.0230 

89 


Table  17 

Eta-squared  Values  for  SOI.  SAT,  and  Florida  Writing  Assessment  for 
Socioeconomic  Groups 


Reading 

Math 

Language 

Grade  3 

SOI 

.0433 

.0184 

.0140 

SAT 

.0594 

.0335 

.0411 

Grade  4 

SOI 

.0243 

.0118 

.0291 

SAT 

.0457 

.0250 

.0279 

FL  WR 

.0165 

Grade  5 

SOI 

.0345 

.0057 

.0128 

SAT 

.0411 

.0532 

.0444 

90 


eta-squared  values  reported  in  Tables  15-17  would  be  considered  small  or 
medium.  Specifically,  medium  effect  sizes  were  found  for  racial-ethnic  groups 
(Table  15)  at  grade  5 for  all  three  SAT  subtests,  and  for  socioeconomic  groups 
(Table  17)  at  grade  3 for  the  Sat  reading  subtest.  Most  of  the  remaining  eta- 
squared  values  for  the  SAT  would  be  considered  "small."  While  the  SOI  and 
Florida  Writing  Assessment  did  not  produce  any  medium  or  large  effect  sizes, 
many  small  effect  sizes  were  evident,  most  consistently  for  socioeconomic 
groups  (Table  17).  Interestingly,  the  vast  majority  of  the  reported  eta-squared 
values  for  the  reading  domain  on  the  SOI  and  SAT  would  be  considered  small 
to  medium  according  to  Cohen's  guidelines. 

Internal  Consistency  and  Standard  Error  of  Measurement 
Tables  18  through  20  present  internal  consistency  coefficients  for 
estimating  the  generalizability  of  ratings  over  the  various  scored  features  of  the 
student  artifacts  in  the  portfolio  and  standard  errors  for  relative  and  absolute 
decisions  for  the  SOI,  in  reading,  math,  and  language,  respectively.  The 
generalizability  coefficients  for  the  SOI  were  estimated  using  a persons-by- 
items  analysis  of  variance  as  described  by  Crocker  and  Algina  (1986),  using 
features  as  the  facet  analogous  to  items.  Values  of  the  generalizability 
coefficient  were  generally  high  across  grade  levels,  subgroups,  and  subject 

areas  with  no  significant  trends  of  inconsistencies  among  the  various 
subgroups  evident.  Values  ranged  from  .84  for  grade  4 students  eligible  for 
reduced-priced  lunch  in  the  language  area  and  grade  5 black  students  in  the 
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Table  18 


G-Coefficients  and  Standard  Errors  for  Relative  and  Absolute  Decisions  for  SOI. 


' - • ^ .».r.  . w IWI  ^ ■ , 

Readina 

G-Coeff 

RSE 

ASE 

3 Race 

■89-.93 

.17-.20 

.18-.21 

Gender 

.91-.94 

O) 

1 

CD 

.17-. 20 

SES 

.91-.92 

.17-.19 

.18-.21 

Total  Grade  3 

.93 

.17 

.18 

4 Race 

.90-.94 

.16-.18 

00 

1 

Gender 

.93 

.16-.17 

.17-.18 

SES 

.91-.95 

.15-.18 

.16-.19 

Total  Grade  4 

.93 

.17 

.18 

5 Race 

.84-.92 

.16-.22 

.17-.22 

Gender 

.91 

.17-. 18 

.17-. 19 

SES 

.90-.93 

.15-.19 

.16-.19 

Total  Grade  5 

.91 

.18 

.18 
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Table  19 


Math 

G-Coeff 

RSE 

ASE 

3 Race 

■93-.97 

.11-.12 

.11-.12 

Gender 

.95-.97 

.11 

.11 

SES 

■96-.97 

.11 

.11 

Total  Grade  3 

.96 

.11 

.11 

4 Race 

.96-.97 

.10-.11 

.10-.11 

Gender 

.96 

.11 

.11 

SES 

.96-.97 

.09-.11 

.09-.  11 

Total  Grade  4 

.96 

.11 

.11 

5 Race 

.9S-.96 

.11-.12 

.11-.13 

Gender 

.96 

.11 

.11 

SES 

.95-.97 

C\J 

1 

o 

.10-.12 

Total  Grade  5 

.96 

.11 

.11 
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Table  20 


G-Coefficients  and  Standard  Errors  for  Relative  and  Absolute  Decisions  for  SOI. 


Lanauaae 

G-Coeff 

RSE 

ASE 

3 Race 

.90-.94 

CO 

1 

cn 

.13-.16 

Gender 

.93 

.14-.15 

.15-.16 

SES 

.92-.95 

.12-.15 

CD 

1 

C\J 

Total  Grade  3 

.93 

.15 

.16 

4 Race 

.92-.96 

.12-.16 

.13-.17 

Gender 

.92-.93 

.14-.17 

.15-.18 

SES 

.84-.93 

.15-.16 

.16-.17 

Total  Grade  4 

.92 

.16 

.17 

5 Race 

.93-.94 

.13-.15 

.14-.16 

Gender 

.93-.94 

.13-.14 

.14-.15 

SES 

.93-.95 

.12-.15 

.13-.15 

Total  Grade  5 

.94 

.14 

.15 

94 


reading  area  to  .97.  In  general,  values  tended  to  be  highest  in  the  mathematics 
area. 

The  calculation  of  the  standard  error  of  measurement  varies  on  whether 
absolute  or  relative  decisions  regarding  examinees  are  to  be  made.  In  the  case 
of  an  absolute  decision  for  the  design  used  in  this  study,  the  standard  error  of 
measurement  is  based  on  the  formula 


In  the  case  of  a relative  decision  for  the  design  used  in  this  study,  the  standard 
error  of  measurement  is  based  on  the  formula 


For  the  above  expressions,  n.,  is  the  number  of  items  on  the  test  and  and 

• 1 e 

were  estimated  using  a two-factor  ANOVA  (Crocker  & Algina,  1986,  p.  176).  As 
would  be  expected,  the  values  for  the  relative  standard  error  of  measurement 
were  smaller  than  those  for  the  absolute  standard  error  of  measurement. 

Values  of  the  relative  standard  error  of  measurement  varied  from  .09  for  grade  4 
students  eligible  for  reduced-price  lunch  in  the  math  area  to  .22  for  grade  5 
Hispanic  and  black  students.  Across  grade  levels  and  subgroups,  values 
tended  to  be  lowest  in  the  math  area,  which  had  the  greatest  number  of  items, 
and  highest  in  the  reading  area.  Similarly,  values  of  the  absolute  standard 
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error  of  measurement  tended  to  be  lowest  in  the  math  area  and  highest  in  the 
reading  area,  which  had  the  fewest  number  of  items.  As  with  the 
generalizability  coefficient,  no  significant  trends  of  inconsistencies  among  the 
various  subgroups  was  evident  for  the  standard  errors  of  measurement. 

Differences  in  Percentages  of  Proficiency  Classification 
Of  principal  concern  in  this  study  was  the  consistency  of  student 
performance  across  measurement  methods.  Also  of  interest  was  the  issue  of 
the  relationship  among  assessments  for  various  groups  (e.g.,  grade  level, 
racial/ethnic,  gender,  and  socioeconomic  level).  In  order  to  examine  these 
issues,  the  proportion  of  students  classified  and  consistently  classified  as 
proficient  on  each  of  the  measurement  methods  was  examined. 

Tables  21-23  present  percentage  of  students  classified  as  proficient  on 
the  SOI,  SAT,  and  Florida  Writing  Assessment;  the  percentage  of  students 
cons/sfenf/y  proficient  across  measures;  and  Cohen’s  Kappa. 

Proficiency  levels  for  the  SOI  were  established  by  the  district  as 
follows: 

The  Reading  Domain  consisted  of  instructional  reading  level  and 
three  reading  items.  Students  needed  to  be  proficient  in  two  of  the 
three  items  and  at  the  third-grade  level  be  instructional  at  the 
middle  proficient  level  or  higher.  At  fourth  grade,  students  needed 
to  be  reading  at  least  at  the  proficient  level,  and  at  fifth  grade, 
students  needed  to  be  reading  at  least  at  the  maturing  level  (e.g., 
at  grade  3 scores  of  1 0 or  higher  were  proficient,  at  grade  4 scores 
of  1 1 or  higher  were  proficient,  and  at  grade  5 scores  of  12  or 
higher  were  proficient). 

The  Language  Domain  consisted  of  four  items.  To  be  considered 
proficient  in  writing,  students  needed  to  score  at  least  proficient  on  a 
majority  (e.g.,  three  of  the  four)  of  the  items.  Scores  of  7 or  higher  were 
proficient. 
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Table  21 

Coefficient  Kappa  and  Percentage  Classified  as  Proficient  in  Reading  by  SOI. 
SAT,  and  Both 


Group 

P-SOI 

P-SAT 

n 

P-Both 

Kapp 

Grade  3 

Total 

57.3 

35.7 

504 

33.1 

.49 

Whites 

60.6 

40.6 

409 

37.7 

.50 

Blacks 

45.0 

17.5 

40 

15.0 

.30 

Hispanics 

41.8 

12.7 

55 

12.7 

.34 

Females 

58.2 

37.8 

225 

35.1 

.50 

Males 

56.6 

34.1 

279 

31.5 

.47 

Not  Eligible 

68.5 

50.3 

143 

49.0 

.58 

Free 

53.6 

30.0 

330 

27.6 

.45 

Reduced 

45.2 

29.0 

31 

19.4 

.26 

Grade  4 

Total 

71.0 

37.2 

489 

36.0 

.35 

Whites 

71.6 

39.4 

391 

38.6 

.38 

Blacks 

70.6 

27.5 

51 

23.5 

.14 

Hispanics 

66.0 

29.8 

47 

27.7 

.28 

Females 

73.4 

40.2 

256 

39.1 

.35 

Males 

68.2 

33.9 

233 

32.6 

.34 

Not  Eligible 

74.0 

47.4 

173 

46.2 

.44 

Free 

69.0 

29.2 

277 

28.2 

.28 

Reduced 

71.8 

48.7 

39 

46.2 

.44 

97 


Table  21 --continued. 


Group 

P-SOI 

P-SAT 

n 

P-Both 

Kappa 

Grade  5 

Total 

70.6 

31.2 

640 

29.1 

.24 

Whites 

76.7 

37.1 

493 

34.7 

.22 

Blacks 

50.0 

12.1 

66 

10.6 

.18 

Hispanics 

50.6 

11.1 

81 

9.9 

.17 

Females 

75.6 

34.8 

316 

32.9 

.23 

Males 

65.7 

27.8 

324 

25.3 

.25 

Not  Eligible 

82.3 

42.9 

175 

42.3 

.26 

Free 

63.5 

25.1 

394 

22.3 

.23 

Reduced 

81.7 

36.6 

71 

33.8 

.13 

Note.  P-SOI  is  the  percentage  of  students  classified  as  proficient  on  the  SOI, 
P-SAT  is  the  percentage  of  students  classified  as  proficient  on  the  SAT,  P-Both 
is  the  percentage  of  students  consistently  classified  as  proficient  on  both 
assessments,  and  Kappa  is  Cohen’s  Kappa. 


Table  22 


Coefficient  Kaooa  and  Percentaae  Classified 

as  Proficient  in 

Mathematics  bv 

SOI.  SAT.  and  Both 

Group 

P-SOI 

P-SAT 

n 

P-Both 

Kappa 

Grade  3 

Total 

54.0 

47.6 

504 

36.1 

.41 

Whites 

54.0 

48.9 

409 

36.9 

.42 

Blacks 

50.0 

32.5 

40 

25.0 

.35 

Hispanics 

56,4 

49.1 

55 

38.2 

.42 

Females 

52.0 

47.1 

225 

35.1 

.42 

Males 

55.6 

48.0 

279 

36.9 

.41 

Not  Eligible 

61.5 

54.5 

143 

40.6 

.29 

Free 

50.6 

45.2 

330 

34.5 

.47 

Reduced 

54.8 

41.9 

31 

32.3 

.36 

Grade  4 

Total 

71.8 

48.7 

489 

43.4 

.33 

Whites 

71.9 

49.1 

391 

43.0 

.30 

Blacks 

72.5 

47.1 

51 

43.1 

.35 

Hispanics 

70.2 

46.8 

47 

46.8 

.54 

Females 

74.2 

51.6 

256 

44.5 

.25 

Males 

69.1 

45.5 

233 

42.1 

.41 

Not  Eligible 

72.8 

56.1 

173 

48.6 

.33 

Free 

70.8 

43.7 

277 

39.0 

.31 

Reduced 

74.4 

51.3 

39 

51.3 

.53 

99 


Table  22--continued. 


Group 

P-SOI 

P-SAT 

n 

P-Both 

Kappa 

Grade  5 

Total 

69.5 

39.7 

640 

34.4 

.25 

Whites 

69.6 

45.4 

493 

38.9 

.28 

Blacks 

77.3 

21.2 

66 

21.2 

.15 

Hispanics 

63.0 

19.8 

81 

17.3 

.17 

Females 

70.9 

38.9 

316 

34.5 

.25 

Males 

68.2 

40.4 

324 

34.3 

.25 

Not  Eligible 

74.9 

57.1 

175 

49.1 

.27 

Free 

66.0 

31.0 

394 

26.6 

.22 

Reduced 

76.1 

45.1 

71 

40.8 

.25 

Note.  P-SOI  is  the  percentage  of  students  classified  as  proficient  on  the  SOI, 
P-SAT  is  the  percentage  of  students  classified  as  proficient  on  the  SAT,  P-Both 
is  the  percentage  of  students  consistently  classified  as  proficient  on  both 
assessments,  and  Kappa  is  Cohen’s  Kappa. 
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Table  23 

Coefficient  Kappa  and  Percentage  Classified  as  Proficient  in  Language  by  SOI. 
SAT,  and  Florida  Writes 


Group 

P-SOI 

P-SAT 

n 

P-Both 

Kapp 

Grade  3 

Total 

52.0 

33.9 

504 

26.2 

.34 

Whites 

51.1 

37.2 

409 

28.4 

.37 

Blacks 

55.0 

27.5 

40 

25.0 

.38 

Hispanics 

56.4 

14.5 

55 

10.9 

.10 

Females 

53.8 

36.4 

225 

29.8 

.40 

Males 

50.5 

31.9 

279 

23.3 

.29 

Not  Eligible 

55.9 

47.6 

143 

33.6 

.28 

Free 

50.6 

28.2 

330 

23.3 

.36 

Reduced 

48.4 

32.3 

31 

22.6 

.28 

Grade  4 

Total 

67.5 

37.8 

489 

33.5 

.30 

Whites 

66.8 

39.9 

391 

35.0 

.31 

Blacks 

66.7 

29.4 

51 

27.5 

.28 

Hispanics 

74.5 

29.8 

47 

27.7 

.18 

Females 

74.2 

43.4 

256 

39.5 

.27 

Males 

60.1 

31.8 

233 

27.0 

.30 

Not  Eligible 

73.4 

48.0 

173 

44.5 

.36 

Free 

63.5 

30.7 

277 

26.4 

.25 

Reduced 

69.2 

43.6 

39 

35.9 

.22 
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Table  23--continued. 


Group 

P-SOI 

P-SAT 

n 

P-Both 

Kappa 

Grade  5 

Total 

73.8 

30.2 

640 

28.4 

.21 

Whites 

74.8 

35.7 

493 

33.5 

.24 

Blacks 

74.2 

13.6 

66 

13.6 

.10 

Hispanics 

66.7 

9.9 

81 

9.9 

.10 

Females 

83.9 

35.8 

316 

34.5 

.15 

Males 

63.9 

24.7 

324 

22.5 

.24 

Not  Eligible 

81.1 

41.1 

175 

39.4 

.22 

Free 

69.5 

23.9 

394 

22.1 

.18 

Reduced 

78.9 

38.0 

71 

36.6 

.23 

Group 

P-SOI 

P-FLWR 

n 

P-Both 

Kappa 

SOI  and  Florida  Writes 

Total 

67.5 

22.5 

489 

21.1 

.20 

Whites 

66.8 

23.8 

391 

22.3 

.22 

Blacks 

66.7 

15.7 

51 

13.7 

.11 

Hispanics 

74.5 

19.1 

47 

19.1 

.15 

Females 

74.2 

26.2 

256 

25.0 

.18 

Males 

60.1 

18.5 

233 

16.7 

.20 

Not  Eligible 

73.4 

28.3 

173 

26.6 

.19 

Free 

63.5 

19.5 

277 

18.1 

.19 

Reduced 

69.2 

17.9 

39 

17.9 

.18 
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Table  23--continued. 


Group 

P-SAT 

P-FLWR 

n 

P-Both 

Kappa 

SAT  and  Florida  Writes 

Total 

37.8 

22.5 

489 

15.1 

.31 

Whites 

39.9 

23.8 

391 

16.4 

.31 

Blacks 

29.4 

15.7 

51 

7.8 

.18 

Hispanics 

29.8 

19.1 

47 

12.8 

.38 

Females 

43.4 

26.2 

256 

17.6 

.27 

Males 

31.8 

18.5 

233 

12.4 

.34 

Not  Eligible 

48.0 

28.3 

173 

23.1 

.39 

Free 

30.7 

19.5 

277 

10.5 

.23 

Reduced 

43.6 

17.9 

39 

12.8 

.22 

Note.  P-SOI  is  the  percentage  of  students  classified  as  proficient  on  the  SOI, 
P-SAT  is  the  percentage  of  students  classified  as  proficient  on  the  SAT, 

P-FLWR  is  the  percentage  of  students  classified  as  proficient  on  Florida  Writes, 
P-Both  is  the  percentage  of  students  consistently  classified  as  proficient  on  both 
assessments  in  that  panel,  and  Kappa  is  Cohen’s  Kappa. 
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The  Math  Domain  consisted  of  8 items.  In  order  to  be  considered 
proficient  in  math,  students  needed  to  score  at  least  proficient  on  a 
majority  (e.g.,  5 of  the  8)  of  the  items.  Scores  of  13  or  higher  were 
proficient. 

The  extent  to  which  the  two  assessment  methods  cons/stenf/y  classified 
students  as  proficient  is  reported  in  Tables  21-23  in  the  column  labeled 
"P-Both."  A commonly  used,  more  interpretable  measure  of  decision 
consistency  is  Cohen’s  Kappa, 


where  is  the  chance  probability  of  a consistent  decision  and  is  calculated  by 
using  the  formula 


^1.  ^.1 

where  P.,  represents  the  probability  of  a mastery  classification  on  one 
measurement,  and  P ^ represents  the  probability  of  a mastery  classification  on 

the  second  measurement.  Similarly,  Pq  and  P ^ represent  the  probability  of  a 

nonmastery  classification  on  each  of  the  two  measurement  methods.  Chance 
consistency  may  be  viewed  as  a baseline  with  which  to  judge  the  actual 
consistency  on  the  two  measurement  methods.  Thus  K,  "may  be  interpreted  as 
the  increase  in  decision  consistency  that  the  tests  provide  over  chance 
expressed  as  a proportion  of  the  maximum  possible  increase  over  chance 
consistency"  (Crocker  & Algina,  1986,  p.  201). 
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Across  the  three  assessment  methods,  the  greatest  proportion  of 
students  were  classified  as  proficient  on  the  SOI,  with  57.3%  of  grade  3 
students,  71.0%  of  grade  4 students,  and  70.6  of  grade  5 students  classified  as 
proficient  in  the  area  of  reading.  In  the  mathematics  area,  54.0%  of  grade  3 
students,  71.8%  of  grade  4 students,  and  69.5%  of  grade  5 students  classified 
as  proficient.  In  the  area  of  language,  52.0%  of  grade  3 students,  67.5%  of 
grade  4 students,  and  73.8%  of  grade  5 students  classified  as  proficient.  The 
percentages  of  students  classified  as  proficient  in  the  corresponding  subject 
areas  on  the  Stanford  were  considerably  lower,  with  discrepancies  in  the  30 
percentage  point  range  occurring  in  grades  4 and  5 reading,  grade  5 math,  and 
grades  4 and  5 language.  The  discrepancies  among  measurement  methods 
becomes  even  greater  when  performance  on  the  Florida  Writing  Assessment  is 
considered.  In  Table  23,  the  percentage  of  grade  4 students  rated  as  proficient 
on  the  SOI  is  67.5,  on  the  SAT  37.8,  and  22.5  on  the  Florida  Writing 
Assessment.  Consequently,  the  proportion  of  students  classified  consistently  as 
proficient  across  the  measurement  methods  is  lowered  by  the  differential 
proficiency  classification  rates  of  each  of  the  assessments.  An  examination  of 
the  Kappa  values  indicates  that  the  relatively  low  values  may  occur  because  of 
the  different  points  on  the  ability  distribution  where  the  standards  have  been  set 
on  the  instruments;  another  possible  contributing  factor  could  be  that  the  two 
types  of  measures  are  actually  measuring  two  different  traits.  As  shown  in 
Tables  21-23,  the  discrepancies  among  the  measurement  methods  generally 
occurred  consistently  across  all  grade,  racial/ethnic,  gender,  and 
socioeconomic  groups. 
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Patterns  of  Convergent  and  Discriminant  Validity  Coefficients 
The  intercorrelations  of  the  SAT,  SOI,  and  Florida  Writing  Assessment  for 
each  grade  level,  racial/ethnic  group,  gender,  and  socioeconomic  group  are 
presented  in  Tables  24-33.  Observed  unadjusted  correlations  are  reported  in 
the  lower  half  of  the  diagonal  of  Tables  24-33,  while  correlations  based  upon 
scores  adjusted  for  class  means  are  found  in  the  upper  half  of  the  tables. 
Performance  on  the  SOI  was  positively  related  to  performance  on  the  SAT  and 
the  Florida  Writing  Assessment.  For  each  grade  level  as  a whole,  convergent 
validity  coefficients  tended  to  be  moderately  high,  with  the  strongest  relationship 
noted  in  the  reading  domain  at  grade  3. 

Of  principal  concern  in  this  study  was  the  pattern  of  the  relationship 
among  the  measurement  methods,  examined  for  different  grade  level,  ethnic, 
gender,  and  socioeconomic  groups.  For  the  sample  at  grade  3,  convergent 
coefficients  for  the  SAT  and  SOI  were  .72,  .58,  and  .52  in  the  reading,  math,  and 
language  domains,  respectively.  In  general,  convergent  coefficients  tended  to 
be  higher  than  discriminant  coefficients  with  the  most  noteworthy  exception 
occurring  in  the  mono-method  discriminant  coefficients  for  the  language  area, 
particularly  for  whites,  both  gender  groups,  and  all  socioeconomic  groups. 

For  the  total  sample  at  grade  4,  convergent  coefficients  for  the  SAT,  SOI, 
and  the  Florida  Writing  Assessment  were  .61,  .54,  .55,  .54,  and  .53  in  the 
reading,  math,  and  language  domains,  respectively.  Convergent  coefficients 
tended  to  be  higher  than  or  nearly  as  high  as  discriminant  coefficients,  with 
notable  exceptions  similar  to  those  at  the  third-grade  level. 
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Table  24 

Intercorrelations  for  the  SAT,  SOI,  and  Florida  Writes  for  the  Total  Third-. 
Fourth-,  and  Fifth-Grade  Samples 


A B C 

SAT  SOI  Fl.Wr. 

Reading  Math  Language  Reading  Math  Language  Writing 


Total  3rd  Grade  Sample 

(n  = 5041 

SAT 

Reading 

.69 

.74 

.72 

.47 

.52 

Math 

.72 

-- 

.69 

.54 

.59 

.46 

Language 

.75 

.74 

- 

.68 

.54 

.53 

SOI 

Reading 

.72 

.62 

.71 

__ 

.51 

.62 

Math 

.45 

.58 

.54 

.59 

-- 

.64 

Language 

.49 

.51 

Total 

4th  Grade  Samole 

.66 

rn  = 4891 

.71 

SAT 

Reading 

.68 

.73 

.65 

.57 

.58 

.53 

Math 

.72 

— 

.68 

.55 

.60 

.48 

.47 

Language 

.77 

.71 

— 

.61 

.59 

.58 

.50 

SOI 

Reading 

.61 

.50 

.60 

__ 

.53 

.67 

.51 

Math 

.53 

.54 

.56 

.60 

— 

.65 

.40 

Language 

.52 

.42 

35 

.70 

.70 

— 

FI.  Writing 

.57 

.51 

.51 

.44 

33 

_ _ 
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Table  24--continued. 


A B C 

SAT  SOI  Fl.Wr. 

Reading  Math  Language  Reading  Math  Language  Writing 


Total  5th  Grade  Sample  (n  = 6401 

A SAT 


Reading 

— 

.64 

.69 

.56 

.48 

.52 

Math 

.71 

— 

.69 

.50 

.57 

.50 

Language 

.74 

.74 

— 

.51 

.53 

.57 

SOI 

Reading 

.59 

.53 

.55 

__ 

.53 

.59 

Math 

.42 

.47 

.43 

.51 

-- 

.62 

Language 

.48 

.44 

.50 

.60 

.68 

— 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  25 


Intercorrelations  for  the  SAT  and  SOI  for  Racial/Ethnic  Groups  at  Third  Grade 


A B 

SAT  SOI 

Reading  Math  Language  Reading  Math  Language 


Whites  (n  = 4091 

A SAT 


Reading 

— 

.68 

.75 

.73 

.48 

.53 

Math 

.73 

— 

.69 

.55 

.60 

.46 

Language 

.77 

.74 

— 

.68 

.54 

B 

SOI 

Reading 

.73 

.63 

.72 

.52 

.62 

Math 

.47 

.60 

.54 

.62 

— 

.65 

Language 

.53 

.52 

32 

.68 

.72 

- 

Blacks  fn  = 401 

A 

SAT 

Reading 

.. 

.77 

.72 

.66 

.24 

.49 

Math 

.78 

— 

.69 

.50 

.49 

.55 

Language 

.70 

.79 

— 

.79 

.53 

B 

SOI 

Reading 

.71 

.56 

.66 

__ 

.27 

.49 

Math 

.42 

.59 

.69 

.53 

.63 

Language 

.45 

.51 

J8 

.56 

.76 

- 

Hisoanics  (n  = 55) 

A 

SAT 

Reading 

— 

.68 

.64 

.71 

.56 

.44 

Math 

.55 

— 

.70 

.57 

.63 

.42 

Language 

.47 

.67 

— 

.63 

.60 

B 

SOI 

Reading 

.64 

.57 

.59 



.62 

.69 

Math 

.21 

.43 

.41 

.34 

— 

.50 

Language 

.18 

.36 

.38 

.59 

.52 

-- 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  26 

Intercorrelations  for  the  SAT.  SOI,  and  Florida  Writes  for  Racial/Ethnic  Groups  at 
Fourth  Grade 


A B C 

SAT  SOI  Fl.Wr. 

Reading  Math  Language  Reading  Math  Language  Writing 


Whites  (n  = 3921 


SAT 


Reading 

— 

.68 

.72 

.63 

.54 

.55 

.51 

Math 

.73 

— 

.68 

.56 

.58 

.46 

.45 

Language 

.77 

.72 

— 

.60 

.57 

35 

M. 

B 

SOI 

Reading 

.60 

.52 

.61 

__ 

.51 

.66 

.49 

Math 

.52 

.53 

.57 

.59 

-- 

.62 

.37 

Language 

.51 

.40 

M 

.69 

.67 

“ 

M. 

C 

FI.  Writing 

.56 

.50 

33 

.51 

.42 

31 

— 

Blacks  fn  = 511 

A 

SAT 

Reading 

.67 

.77 

.71 

.73 

.68 

.49 

Math 

.61 

— 

.76 

.63 

.70 

.64 

.58 

Language 

.73 

.69 

— 

.62 

.78 

JA 

38 

B 

SOI 

Reading 

.57 

.42 

.52 

__ 

.70 

.69 

.46 

Math 

.59 

.60 

.64 

.74 

— 

.76 

.48 

Language 

.60 

.50 

31 

.72 

.80 

-- 

31 

C 

FI.  Writing 

.51 

.63 

m. 

.36 

.48 

m. 

.. 

110 


Table  26--continued. 


Reading 

A 

SAT 

Math 

Language 

Reading 

B 

SOI 

Math 

Language 

C 

Fl.Wr. 

Writing 

Hispanics  fn 

A 

SAT 

Reading 



.70 

.77 

.77 

.69 

.69 

.75 

Math 

.69 

— 

.62 

.41 

.66 

.46 

.57 

Language 

.74 

.59 

— 

.77 

.56 

35 

M 

B 

SOI 

Reading 

.72 

.45 

.67 

.55 

.71 

.72 

Math 

.66 

.66 

.45 

.58 

— 

.75 

.59 

Language 

.62 

.45 

35 

.71 

.77 

— 

39 

C 

FI.  Writing 

.77 

.54 

32 

.68 

.60 

30 

- 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  27 


Intercorrelations  for  the  SAT  and  SOI  for  Racial/Ethnic  Groups  at  Fifth  Grade 


Reading 

A 

SAT 

Math 

Language 

Reading 

B 

SOI 

Math 

Language 

Whites  fn  = 

4931 

A 

SAT 

Reading 

-- 

.63 

.68 

.56 

.49 

.53 

Math 

.69 

— 

.68 

.50 

.56 

.48 

Language 

.72 

.73 

— 

.50 

.52 

36 

B 

SOI 

Reading 

.56 

.50 

.52 



.51 

.58 

Math 

.47 

.50 

.49 

.56 

.62 

Language 

.50 

.44 

31 

.62 

.70 

— 

Blacks  fn  = 

66) 

A 

SAT 

Reading 

.64 

.70 

.44 

.38 

.41 

Math 

.67 

— 

.74 

.50 

.55 

.45 

Language 

.76 

.72 

— 

.56 

.57 

B 

SOI 

Reading 

.52 

.51 

.61 

__ 

.63 

.59 

Math 

.42 

.52 

.48 

.54 

— 

.70 

Language 

.44 

.47 

.61 

.70 

— 

Hispanics  fn 

^81) 

A 

SAT 

Reading 

__ 

.63 

.74 

.63 

.54 

.55 

Math 

.70 

— 

.66 

.47 

.58 

.61 

Language 

.75 

.63 

— 

.50 

.51 

B 

SOI 

Reading 

.67 

.49 

.56 

_ _ 

.52 

.63 

Math 

.36 

.49 

.26 

.42 

„ 

.53 

Language 

.46 

.50 

.46 

.60 

.55 

" 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  28 

Intercorrelations  for  the  SAT  and  SOI  for  Gender  Groups  at  Third  Grade 


Reading 

A 

SAT 

Math 

Language 

Reading 

B 

SOI 

Math 

Language 

Females  fn : 

= 225) 

A 

SAT 

Reading 

— 

.71 

.74 

.70 

.44 

.51 

Math 

.75 

— 

.69 

.55 

.55 

.50 

Language 

.76 

.75 

— 

.70 

.55 

B 

SOI 

Reading 

.69 

.61 

.70 



.52 

.60 

Math 

.45 

.60 

.57 

.65 

— 

.67 

Language 

.48 

.52 

32 

.66 

.75 

— 

Males  (n  = 

279) 

A 

SAT 

Reading 



.68 

.73 

.73 

.50 

.52 

Math 

.69 

— 

.70 

.54 

.61 

.44 

Language 

.74 

.74 

-- 

.67 

.54 

31 

B 

SOI 

Reading 

.75 

.62 

.71 

__ 

.51 

.64 

Math 

.46 

.57 

.54 

.56 

— 

.63 

Language 

.50 

.49 

.52 

.66 

.69 

” 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  29 

Intercorrelations  for  the  SAT.  SOI,  and  Florida  Writes  for  Gender  Groups  at 
Fourth  Grade 


Reading 

A 

SAT 

Math 

Language 

Reading 

B 

SOI 

Math 

Language 

c 

Fl.Wr. 

Writing 

Females  fn : 

= 2561 

A 

SAT 

Reading 

.69 

.73 

.65 

.57 

.55 

.49 

Math 

.74 

— 

.68 

.49 

.58 

.41 

.43 

Language 

.77 

.71 

— 

.57 

.58 

B 

SOI 

Reading 

.59 

.45 

.55 



.51 

.64 

A7 

Math 

.52 

.53 

.54 

.61 

— 

.61 

.29 

Language 

.48 

.36 

.67 

.67 

AO 

C 

FI.  Writing 

.55 

.52 

AL 

A7 

.35 

A4 

- 

Males  fn  = 

233) 

A 

SAT 

Reading 

__ 

.69 

.73 

.65 

.58 

.60 

.56 

Math 

.70 

— 

.70 

.61 

.61 

.55 

.53 

Language 

.76 

.72 

— 

.66 

.61 

32 

33 

B 

SOI 

Reading 

.63 

.56 

.64 

__ 

.56 

.69 

.54 

Math 

.54 

.56 

.58 

.60 

-- 

.69 

.51 

Language 

.54 

.48 

.71 

.72 

-- 

33 

C 

FI.  Writing 

.58 

.52 

.53 

.52 

m 

- 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  30 

Intercorrelations  for  the  SAT  and  SOI  for  Gender  Groups  at  Fifth  Grade 


Reading 

A 

SAT 

Math 

Language 

Reading 

B 

SOI 

Math 

Language 

Females  fn ; 

= 3161 

A 

SAT 

Reading 

.64 

.71 

.61 

.49 

.56 

Math 

.67 

— 

.73 

.54 

.58 

.54 

Language 

.73 

.75 

— 

.54 

.52 

39 

B 

SOI 

Reading 

.60 

.53 

.57 

__ 

.52 

.60 

Math 

.42 

.48 

.44 

.49 

— 

.63 

Language 

.50 

.46 

33 

.59 

.67 

” 

Males  (n  = 

3241 

A 

SAT 

Reading 

__ 

.66 

.66 

.52 

.47 

.46 

Math 

.75 

— 

.67 

.48 

.55 

.47 

Language 

.74 

.75 

— 

.48 

.53 

33 

B 

SOI 

Reading 

.57 

.54 

.53 

__ 

.53 

.59 

Math 

.42 

.45 

.41 

.53 

— 

.63 

Language 

.44 

.43 

.45 

.60 

.69 

-- 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  31 


Intercorrelations  for  the  SAT  and  SOI  for  Socioeconomic  Groups  at  Third  Grade 


Reading 

A 

SAT 

Math 

Language 

Reading 

B 

SOI 

Math 

Language 

Not  Elioible  fn 

= 1431 

A 

SAT 

Reading 

-- 

.72 

.75 

.73 

.47 

.57 

Math 

.76 

— 

.67 

.57 

.58 

.54 

Language 

.76 

.72 

— 

.69 

.47 

B 

SOI 

Reading 

.72 

.61 

.68 



.49 

.65 

Math 

.35 

.52 

.42 

.57 

— 

.59 

Language 

.51 

.53 

.49 

.69 

.65 

— 

Free  fn  = 3301 

A 

SAT 

Reading 



.68 

.74 

.73 

.49 

.50 

Math 

.69 

— 

.69 

.54 

.59 

.43 

Language 

.73 

.73 

“ 

.70 

.56 

B 

SOI 

Reading 

.71 

.61 

.71 

__ 

.53 

.61 

Math 

.48 

.61 

.59 

.60 

— 

.65 

Language 

.47 

.50 

M 

.65 

.72 

- 

Reduced (n 

A 

SAT 

Reading 

-- 

.43 

.55 

.51 

.30 

.36 

Math 

.56 

— 

.72 

.48 

.63 

.42 

Language 

.72 

.76 

" 

.54 

.60 

B 

SOI 

Reading 

.61 

.48 

.55 



.36 

.57 

Math 

.45 

.55 

.52 

.50 

— 

.74 

Language 

.41 

.34 

.43 

.56 

.84 

-- 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  32 


Intercorrelations  for  the  SAT.  SOI.  and  Florida  Writes  for  Socioeconomic  Groups 

at  Fourth  Grade 

A 

B 

C 

SAT 

SOI 

Fl.Wr. 

Reading 

Math 

Language 

Reading 

Math 

Language 

Writing 

Not  Eliaible  (n  = 

173) 

A SAT 

Reading 

.. 

.72 

.76 

.67 

.62 

.58 

.54 

Math 

.77 

— 

.70 

.58 

.63 

.49 

.53 

Language 

.81 

.75 

-- 

.63 

.63 

M 

31 

B SOI 

Reading 

.63 

.51 

.61 

__ 

.51 

.70 

.53 

Math 

.61 

.57 

.63 

.60 

— 

.66 

.44 

Language 

.52 

.43 

.72 

.71 

-- 

31 

C FI.  Writing 

.59 

.58 

.52 

.52 

35 

- 

Free  (n  = 711\ 

A SAT 

Reading 



.64 

.69 

.62 

.55 

.57 

.50 

Math 

.66 

-- 

.65 

.51 

.57 

.45 

.43 

Language 

.72 

.67 

— 

.60 

.56 

z46 

B SOI 

Reading 

.57 

.47 

.59 

.. 

.56 

.65 

.50 

Math 

.47 

.52 

.52 

.61 

-- 

.65 

.38 

Language 

.52 

.42 

.69 

.70 

— 

31 

C FI.  Writing 

.52 

.45 

M. 

.49 

.38 

31 

__ 
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Table  32--continued. 


Reading 

A 

SAT 

Math 

Language 

Reading 

B 

SOI 

Math 

Language 

C 

Fl.Wr. 

Writing 

Reduced  (n 

= 39) 

A 

SAT 

Reading 

-- 

.64 

.73 

.62 

.36 

.43 

.63 

Math 

.71 

-- 

.65 

.55 

.52 

.47 

.40 

Language 

.75 

.68 

— 

.51 

.39 

M 

B 

SOI 

Reading 

.71 

.60 

.60 

__ 

.34 

.54 

.36 

Math 

.46 

.57 

.42 

.54 

— 

.46 

.24 

Language 

.36 

.25 

M 

.65 

.42 

— 

28 

C 

FI.  Writing 

.73 

.47 

M 

.53 

.28 

— 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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Table  33 


Intercorrelations  for  the  SAT  and  SOI  for  Socioeconomic  Groups  at  Fifth  Grade 


A B 

SAT  SOI 

Reading  Math  Language  Reading  Math  Language 


Not  Eligible  (n  = 1751 

A SAT 


Reading 

— 

.60 

.68 

.60 

.48 

.53 

Math 

.68 

— 

.63 

.44 

.56 

.49 

Language 

.74 

.69 

— 

.45 

.50 

34 

B 

SOI 

Reading 

.63 

.46 

.48 

__ 

.39 

.54 

Math 

.47 

.44 

.41 

.45 

— 

.59 

Language 

.54 

.44 

.50 

.59 

.69 

- 

TI 

CD 

CD 

S' 

II 

CO 

CD 

A 

SAT 

Reading 

— 

.67 

.70 

.55 

.47 

.51 

Math 

.72 

— 

.70 

.52 

.54 

.49 

Language 

.74 

.74 

— 

.54 

.51 

39 

B 

SOI 

Reading 

.58 

.55 

.57 

__ 

.56 

.62 

Math 

.40 

.46 

.41 

.53 

.63 

Language 

.44 

.42 

30 

.60 

.67 

- 

Reduced  fn  = 711 

A 

SAT 

Reading 

__ 

.51 

.61 

.52 

.52 

.52 

Math 

.55 

— 

.74 

.41 

.67 

.49 

Language 

.67 

.75 

— 

.42 

.63 

3Q 

B 

SOI 

Reading 

.45 

.39 

.46 

.58 

.53 

Math 

.42 

.53 

.52 

.54 

— 

.65 

Language 

.44 

.41 

.42 

.57 

.66 

— 

Note.  The  observed  unadjusted  correlations  are  found  in  the  lower  half  of  the  table. 
Correlations  based  upon  scores  adjusted  for  class  means  are  found  in  the  upper  half  of 
the  table. 
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For  the  total  sample  at  grade  5,  convergent  coefficients  for  the  SAT  and 
SOI  were  .59,  .47,  and  .50  in  the  reading,  math,  and  language  domains, 
respectively.  The  patterns  of  convergent  coefficients  and  discriminant 
coefficients  tended  to  be  similar  to  those  for  the  other  grade  levels,  with  mono- 
method discriminant  coefficients  in  the  language  domain  exhibiting  some  errant 
patterns,  particularly  for  Hispanic  students. 

Similar  patterns  were  evident  for  the  convergent  coefficients  based  upon 
scores  adjusted  for  class  means,  which  are  found  in  the  upper  half  of  the  tables. 
Based  on  the  similarity  of  the  coefficients  for  adjusted  scores  and  unadjusted 
scores  and  lack  of  a systematic  affect  of  the  adjustment,  subsequent  analyses  in 
this  chapter  will  be  reported  only  for  the  unadjusted  scores. 

Research  question  five  addressed  the  pattern  of  convergent  and 
discriminant  validity  coefficients  between  the  Stanford,  the  SOI,  and  the  Florida 
Writing  Assessment.  Specific  statistical  hypotheses  were  formulated  to  provide 
criteria  for  addressing  the  research  question.  Let  represent  the 

correlation  between  the  SOI  and  the  Stanford  in  the  reading  domain. 
Comparable  notation  will  represent  the  domain  areas  of  math  (m)  and  language 
(I).  Let  represent  the  correlation  between  the  SOI  language  domain  and 

the  Florida  Writing  Assessment.  Eight  families  of  statistical  tests  were  defined  to 
answer  the  fifth  research  question.  Families  1 through  3 each  consisted  of  four 
a priori  hypotheses  and  families  4 through  8 each  consisted  of  two  a priori 
hypotheses.  They  were  as  follows: 

Hla:  p . >p 

•^soir.satr  ^soir.satm 
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Hib: 

P * 

^ P ■ 

•^soir.satr 

soir,satl 

Hie: 

P * 

^ P . 

^ soir.satr 

satr,soim 

Hid: 

p . 

^ P 

■^soir.satr 

^ satr,soil 

H2a: 

P 

^ P 

soim.satm 

'^soim,satr 

H2b: 

P 

^ P 

■^soim.satm 

soim,satl 

H2c: 

P 

^ P . 

^ soim.satm 

^ satm,soir 

H2d: 

p . 

> p 

^soim.satm 

'^satm,soil 

H3a: 

P . , ^ 

P 

soil, sail 

^ soil,satr 

CO 

X 

P ^ 

P 

soil, sail 

■^soihsatm 

H3c: 

P > 

■^soi^satl 

Psatl,soir 

H3d: 

P > 

^soil,satl 

^satl,  soim 

H4a: 

P . 

> p . . 

^ soir,satr 

^ soir,soim 

H4b: 

p . 

> P . .. 

' soir,satr 

soir,soil 

H5a: 

P 

> p . 

'^soim,satm 

'^soim,soir 

H5b: 

P 

> p . 

'^soim,satm 

'^soim,soil 

H6a: 

P > 

soil, sail 

Psoil,soir 

H6b: 

P ^ 

P 

'^soil,satl 

soil,soim 
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For  fourth  grade  only; 


H7a:  p ,,,  >p  . 

* ftni  f \A/r  * cm 


soil.flwr  '^soim.flwr 


H7b:  p >p 

^soil.flwr  f^soir.flwr 


H8a:  p , ,,  > p , ,, 

■^soil.flwr  '^satm.flwr 


H8b:  ,2p 


soil.flwr  '■^satr.flwr 


Hypotheses  1-3  contrast  convergent  validity  coefficients  against  hetero-trait, 
hetero-method  coefficients.  Hypotheses  4-6  contrast  convergent  coefficients 
against  mono-method  coefficients.  Each  family  of  hypotheses  was  tested  for 
each  grade  level  and  subgroup.  According  to  Steiger  (1980),  the  most  efficient 
statistical  test  of  two  dependent  correlations  within  a correlation  matrix  that 
share  a common  variable  is 


where  Zy^  and  Zy^  are  Fisher  z transformations  of  values  taken  from  the 
correlation  matrix,  and  Sy^  y^  is  the  asymptotic  covariance  of  r^^  and  (Steiger, 

1980).  The  nominal  Type  I error  for  each  hypothesis  was  .025  and  thus  .10  for 
families  1 through  3 and  .05  for  families  4 through  8. 

Summaries  of  the  significance  of  the  test  statistics  are  reported  in  Tables 
34-38.  Although  no  consistent  patterns  of  the  relationship  among  the 
coefficients  was  clearly  defined,  some  noteworthy  results  were  found.  Although 
there  is  limited  evidence  of  construct  validity  for  the  SOI,  the  reading  domain 


Summary  of  Contrasts  of  Convergent  Validity  Coefficients  with  Discriminant  Validity  Coefficients  Using  Steiger’s  Z*  for 
Each  Grade  Level 
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Summary  of  Contrasts  of  Convergent  Validity  Coefficients  with  Discriminant  Validity  Coefficients  Using  Steiger’s  Z*  for 
Grade  3 Racial/Ethnic.  Gender,  and  Socioeconomic  Groups 
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Summary  of  Contrasts  of  Convergent  Validity  Coefficients  with  Discriminant  Validity  Coefficients  Using  Steiger’s  Z*  for 
Grade  4 Racial/Ethnic.  Gender,  and  Socioeconomic  Groups 
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Summary  of  Contrasts  of  Convergent  Validity  Coefficients  with  Discriminant  Validity  Coefficients  Using  Steiger’s  Z*  for 
Grade  5 Racial/Ethnic.  Gender,  and  Socioeconomic  Groups 
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Table  37--continued. 
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Table  38 

Summary  of  Contrasts  of  Convergent  Validity  Coefficients  with  Discriminant 
Validity  Coefficients  Using  Steiger’s  Z*  for  Grade  4 Florida  Writing  Assessment 
for  Subgroups 


Subgroup  FLWR-SOI  Disc.  Coeff.  with  SOI  FLWR-SAT  Disc.  Coeff.  with  SAT 

Conv.  Validity  R M L Conv.  Validity  R M L 


Whites 

.51 

+ 

NA 

.53 

NA 

Blacks 

.57 

NA 

.57 

NA 

Hispanics 

.60 

NA 

.62 

NA 

Females 

.44 

NA 

.47 

NA 

Males 

.58 

NA 

.58 

NA 

Not  Elig. 

.55 

NA 

.58 

NA 

Free 

.51 

+ 

NA 

.47 

NA 

Reduced 

.37 

NA 

.67 

+ NA 

Note.  NA  designates  “not  applicable  to  this  trait”;  + designates  the  convergent  validity  coefficient 
was  significantly  greater  than  the  discriminant  coefficient  (p  < .025). 


had  the  strongest  evidence  of  construct  validity,  particularly  at  the  third  grade, 
as  reported  in  Table  34.  SOI  Math  displayed  only  modest  evidence  of  construct 
validity  at  grades  3 and  4.  For  SOI  Language,  the  convergent  validity  coefficient 
exceeded  discriminant  coefficients  only  once,  at  grade  4.  Furthermore, 
although  grade  3 had  the  greatest  number  of  positive  findings,  grade  4 had  at 
least  one  positive  finding  in  each  content  area. 

For  reader  interest,  results  of  the  convergent  and  discriminant  validity 
comparisons  are  reported  for  racial/ethnic,  gender,  and  socioeconomic 
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subgroups  in  Tables  35  through  38.  However,  it  is  inappropriate  to  attempt  to 
make  direct  comparisons  of  patterns  across  subgroups  based  on  this  analysis. 

Factor  Analysis 

To  explore  the  issue  of  the  construct  validity  of  the  SOI,  factor  analysis 
was  used.  A separate  factor  analysis  was  conducted  for  each  grade  level. 
Initially,  an  exploratory  principal  axis  factor  analysis,  in  which  communalities 
were  used  in  the  diagonals  of  the  correlation  matrix,  was  conducted. 
Eigenvalues  and  percent  of  variance  explained  for  Grade  3 are  presented  in 
Table  39. 

The  results  revealed  that  for  grade  3 as  a whole,  a two-factor  model 
would  explain  81 .4%  of  the  variance,  with  an  eigenvalue  of  the  first  factor  of 

Table  39 

Exploratory  Principal  Axis  Factor  Analysis  Eigenvalues  and  Percent  of  Variance 
Explained  for  Grade  3 


Factor 

Eigenvalue 

Percent  of  Variance 

Cumulative  Percent 

1 

4.10935 

68.5 

68.5 

2 

.77707 

13.0 

81.4 

3 

.42257 

7.0 

88.5 

4 

.25561 

4.3 

92.7 

5 

.23521 

3.9 

96.7 

6 

.20015 

3.3 

100.0 
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4.10935  and  for  the  second  factor  of  .77707.  Because  only  one  eigenvalue  was 
greater  than  1 .00,  this  initial  exploratory  factor  analysis  produced  communalities 
and  factor  loadings  based  upon  a one-factor  model.  Communalities  and  factor 
loadings  for  this  model  are  reported  in  Table  40. 

The  factor  loadings  appear  to  be  large,  with  the  factor  loadings  highest 
for  the  SOI  reading  and  loadings  for  SAT  reading,  language,  and  math 
subtests  all  greater  than  .80.  The  scree  plot  (Figure  7)  illustrates  the  relative 
magnitudes  of  eigenvalues  with  the  first  one  of  4.10935,  the  second  eigenvalue 
close  to  1 .00  (.77707),  and  the  remaining  four  values  substantially  less  than 
1.00. 

Table  40 

Communalities  and  Factor  Loadings  for  the  SOI  and  SAT-Total  Grade  3 
Sample 


Instrument 

Subtest 

Communality 

Factor  1 

SOI 

Reading 

.71873 

.84778 

Language 

.50699 

.71203 

Math 

.50369 

.70971 

SAT 

Reading 

.64961 

.80599 

Language 

.71091 

.84316 

Math 

.65357 

.80844 
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Factor  Number 

Figure  7.  Scree  plot  of  eigenvalues  for  all  grade  3 students. 

To  explore  the  possibility  of  a weak  method  factor,  a second  analysis 
was  conducted  using  principal  axis  factor  analysis  with  a two-factor 
specification.  A varimax  rotation,  which  converged  in  three  iterations,  produced 
the  factor  loading  estimates  reported  in  Table  41.  Factor  one  is  primarily 
correlated  with  the  SAT  reading,  language,  and  math  subtests.  In  addition,  the 
SOI  reading  subtest  loads  on  the  first  factor  but  also  loads  substantially  on  the 
second.  Factor  two  is  primarily  correlated  with  the  SOI  language  and  math 
subtests.  The  dual  loading  of  SOI  reading  on  both  factors  may  be  partially 
attributable  to  the  fact  that  the  SOI  reading  score  is  based  largely  upon  the 
instructional  reading  level,  which  is  obtained  through  structured  running 
records.  This  process  of  obtaining  the  SOI  reading  score  more  closely 
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resembles  the  procedures  used  in  the  standardized  SAT,  rather  than  the  SOI 
math  and  language  areas.  Communality  estimates  ranged  from  .67069  for  the 
SAT  math  to  .80136  for  the  SAT  reading.  The  communality  estimates  indicate 
that  relatively  large  proportions  of  variations  in  the  subtests  are  shared  across 
the  two  factors.  Estimates  obtained  from  an  oblique  solution  (promax)  method 
followed  a similar  pattern  but  did  not  improve  interpretability. 


Table  41 


Grade  Sample 

Instrument 

Subtest 

Communality 

Factor  1 

Factor  2 

SOI 

Reading 

.69355 

.64275 

.52956 

Language 

.76741 

.30223 

.82223 

Math 

.66478 

.34155 

.74035 

SAT 

Reading 

.79312 

.85075 

.26333 

Language 

.76486 

.79850 

.35672 

Math 

.66709 

.72044 

.38479 

A similar  procedure  was  followed  for  grade  4.  Eigenvalues  and 
percent  of  variance  explained  for  the  total  grade  4 sample  are  presented  in 
Table  42. 

The  results  revealed  that  for  grade  4 as  a whole,  a two-factor  model 
would  explain  75.5%  of  the  variance,  with  an  eigenvalue  of  the  first  factor  of 
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Table  42 

Exploratory  Principal  Axis  Factor  Analysis  Eigenvalues  and  Percent  of  Variance 
Explained  for  Grade  4 


Factor 

Eigenvalue 

Percent  of  Variance 

Cumulative  Percent 

1 

4.47534 

63.9 

63.9 

2 

.81096 

11.6 

75.5 

3 

.56151 

8.0 

83.5 

4 

.41240 

5.9 

89.4 

5 

.29061 

4.2 

93.6 

6 

.22892 

3.3 

96.9 

7 

.22026 

3.1 

100.0 

4.47534  and  for  the  second  factor  of  .81096.  Because  only  one  eigenvalue  was 
greater  than  1 .00,  this  initial  exploratory  factor  analysis  produced  communalities 
and  factor  loadings  based  upon  a one-factor  model.  Communalities  and  factor 
loadings  for  this  model  are  reported  in  Table  43. 

The  factor  loadings  appear  to  be  large,  with  the  factor  loadings  highest 
for  the  SAT  language  and  loadings  for  SAT  reading  and  language  subtests 
greater  than  .80.  The  communality  estimate  for  the  Florida  Writing  Assessment 
was  the  lowest  at  .44377.  The  scree  plot  (Figure  8)  illustrates  the  relative 
magnitudes  of  eigenvalues  with  the  first  one  of  4.47534,  the  second  eigenvalue 
close  to  1.00  (.81096),  and  the  remaining  five  values  substantially  less  than 


1.00. 
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Table  43 


4 Sample 

Instrument 

Subtest 

Communality 

Factor  1 

SOI 

Reading 

.60042 

.77487 

Language 

.54521 

.73838 

Math 

.53715 

.73290 

SAT 

Reading 

.68827 

.82962 

Language 

.69417 

.83317 

Math 

.55718 

.74645 

FLWR 

Writing 

.44377 

.66616 

Factor  Scree  Plot 


Factor  Number 


Figure  8.  Scree  plot  of  eigenvalues  for  all  grade  4 students. 
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As  was  done  at  grade  3,  the  possibility  of  the  existence  of  a weak  method 
factor  was  explored  in  a second  factor  analysis  was  conducted  using  principal 
axis  factor  analysis  with  a two-factor  specification.  A varimax  rotation,  which 
converged  in  three  iterations,  produced  the  factor  loading  estimates  reported  in 
Table  44.  Factor  one  is  primarily  correlated  with  the  SAT  reading,  language, 
and  math  subtests.  In  addition,  the  Florida  Writing  Assessment  correlates  most 
highly  with  the  first  factor  but  the  relationship  is  relatively  weak.  Factor  two  is 
primarily  correlated  with  the  SOI  reading,  language,  and  math  subtests;  with 
SOI  language  demonstrating  the  strongest  relationship.  Communality 
estimates  ranged  from  .43141  for  the  Florida  Writing  Assessment  to  .95862  for 
SOI  language.  The  communality  estimates  indicate  that  relatively  large 
proportions  of  variations  in  the  subtests  are  shared  across  the  two  factors. 
Estimates  obtained  from  an  oblique  solution  (promax)  method  followed  a similar 
pattern  but  did  not  improve  interpretability. 

The  analyses  for  grade  5 produced  results  similar  to  grades  3 and  4. 
Eigenvalues  and  percent  of  variance  explained  are  presented  in  Table  45. 

The  results  revealed  that  for  grade  5 as  a whole,  a two-factor  model 
would  explain  78.4%  of  the  variance,  with  an  eigenvalue  of  the  first  factor  of 
4.10935  and  for  the  second  factor  of  3.80676  and  for  the  second  factor  of 
.89875.  Because  only  one  eigenvalue  was  greater  than  1 .00,  this  initial 
exploratory  factor  analysis  produced  communalities  and  factor  loadings  based 
upon  a one-factor  model.  Communalities  and  factor  loadings  for  this  model  are 
reported  in  Table  46. 
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Table  44 

Rotated  Factor  Loadings  for  the  SOI.  SAT,  and  FLWR  Two  Factor  Solution- 
Total  Fourth  Grade  Sample 


Instrument 

Subtest 

Communality 

Factor  1 

Factor  2 

SOI 

Reading 

.61710 

.47371 

.62665 

Language 

.95862 

.24068 

.94905 

Math 

.56691 

.42970 

.61828 

SAT 

Reading 

.77977 

.81597 

.33758 

Language 

.74455 

.77658 

.37613 

Math 

.68193 

.78631 

.25229 

FLWR 

Writing 

.43141 

.51377 

.40920 

Table  45 

Exploratory  Principal  Axis  Factor  Analysis  Eigenvalues  and  Percent  of  Variance 
Explained  for  Grade  5 


Factor 

Eigenvalue 

Percent  of  Variance 

Cumuiative  Percent 

1 

3.80676 

63.4 

63.4 

2 

.89875 

15.0 

78.4 

3 

.46814 

7.8 

86.2 

4 

.33105 

5.5 

91.7 

5 

.26837 

4.5 

96.2 

6 

.22694 

3.8 

100.0 
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Table  46 

Communalities  and  Factor  Loadings  for  the  SOI  and  SAT--Total  Grade  5 
Sample 


Instrument 

Subtest 

Communality 

Factor  1 

SOI 

Reading 

.54594 

.73888 

Language 

.48722 

.69801 

Math 

.41457 

.64387 

SAT 

Reading 

.65301 

.80809 

Language 

.66380 

.81474 

Math 

.61795 

.78610 

The  factor  loadings  appear  to  be  large,  with  the  factor  loadings  highest 
for  the  SAT  reading  and  language,  which  were  both  greater  than  .80.  The  scree 
plot  (Figure  9)  illustrates  the  relative  magnitudes  of  eigenvalues  with  the  first 
one  of  3.80676,  the  second  eigenvalue  close  to  1.00  (.89875),  and  the 
remaining  four  values  substantially  less  than  1 .00. 

A second  factor  analysis  was  conducted  using  principal  axis  factor 
analysis  with  a two-factor  specification.  A varimax  rotation,  which  converged  in 
three  iterations,  produced  the  factor  loading  estimates  reported  in  Table  47. 
Factor  one  is  primarily  correlated  with  the  SAT  reading,  language,  and  math 
subtests.  Factor  two  is  primarily  correlated  with  the  SOI  reading,  language,  and 
math  subtests.  However,  as  was  evident  at  grade  3,  the  strength  of  the 
relationship  with  SOI  reading  was  similar  for  both  factors.  Communality 
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Factor  Scree  Plot 


Factor  Number 


Figure  9.  Scree  plot  of  eigenvalues  for  all  grade  5 students. 

Table  47 

Rotated  Factor  Loadings  for  the  SOI  and  SAT  Two  Factor  Solution-Total  Fifth 


Grade  Sample 

Instrument 

Subtest 

Communality 

Factor  1 

Factor  2 

SOI 

Reading 

.53658 

.49242 

.54231 

Language 

.83097 

.26071 

.87350 

Math 

.55710 

.28871 

.68829 

SAT 

Reading 

.73541 

.80033 

.30802 

Language 

.75761 

.81483 

.30605 

Math 

.69940 

.78291 

.29401 
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estimates  ranged  from  .53658  for  SOI  reading  to  .83097  for  SOI  language. 
Estimates  obtained  from  an  oblique  solution  (promax)  method  followed  a similar 
pattern  but  did  not  improve  interpretability. 

In  summary,  by  conventional  standards,  an  argument  could  be  made  for 
the  existence  of  a single  factor  for  these  data.  At  each  grade  level,  only  one 
factor  with  an  eigenvalue  greater  than  1 .00  was  present.  However,  the  results 
of  the  two-factor  solution  were  presented  in  order  to  explore  the  possibility  that 
two  method-related  factors  exist.  Only  a weak  argument  could  be  made  for  the 
existence  of  two  factors. 


CHAPTER  5 

DISCUSSION  AND  IMPLICATIONS  FOR  FURTHER  RESEARCH 
This  study  was  designed  to  investigate  the  consistency  of  student 
performance  across  different  types  of  measurement  methods  used  in  a Title  I 
program  in  a middle-sized  southeastern  rural/suburban  school  district.  The 
year  in  which  this  study  occurred  coincided  with  the  implementation  of  the  1994 
reauthorization  of  Title  I of  the  Elementary  and  Secondary  Education  Act.  The 
grade  3-5  students  who  were  enrolled  in  the  district’s  Title  I Schoolwide 
programs  served  as  subjects.  Data  on  a standardized  achievement  measure,  a 
portfolio-based  instrument,  and  the  state’s  writing  performance  assessment 
were  collected.  The  generalizability  of  findings  was  examined  for 
subpopulations  defined  by  grade  level,  ethnic  group,  gender,  and 
socioeconomic  level.  The  context  for  the  study  was  chosen  because  of  the 
strong  support  for  the  use  of  multiple  forms  of  assessment  in  evaluating 
educational  programs,  particularly  Title  I,  that  occurred  concurrently  to  the  1994 
reauthorization  of  Title  I of  the  Elementary  and  Secondary  Education  Act  by 
Feuer  (1994),  LeTendre  (1991),  Jenkins  (1993),  and  Shepard  (1992b),  among 
others.  If  the  consequences  for  individual  students  differ  depending  on  type  of 
instrument  used,  then  this  may  raise  questions  in  terms  of  the  legal  issues  of 
equity  and  equal  access  to  services.  Furthermore,  if  decisions  about  the 
effectiveness  of  school  Title  I programs  vary  as  a function  of  the  assessment 


141 


142 


method,  this  could  have  important  implications  for  Title  I resource  allocations, 
instructional  strategies,  program  improvement,  student  selection  into 
supplemental  Title  I programs,  and  the  way  in  which  Title  I programs  are  to  be 
evaluated  at  school  and  district  levels.  Within  this  context,  the  findings  of  this 
study  would  inform  policy  makers  regarding  the  impact  of  varying  assessment 
methods  on  evaluation  results. 

Summary  of  Findings 

The  findings  of  the  study  indicate  that  measurement  method  has  an 
impact  on  the  outcome  of  the  evaluation  results  for  individual  students.  While 
the  classroom-based,  portfolio-based  assessment  yielded  the  greatest  number 
of  students  meeting  the  criteria  for  proficiency,  the  state’s  performance 
assessment  in  writing  at  grade  4 yielded  the  lowest  percentage  of  students 
meeting  the  proficiency  criteria.  More  students  were  classified  as  proficient  on 
the  portfolio  assessment  than  on  the  standardized  achievement  test  as  well. 
This  pattern  of  differential  proportions  of  students  being  classified  as  proficient 
as  a function  of  measurement  method  was  generally  consistent  across  all  of  the 
subgroups  examined  (grade  level,  racial/ethnic,  gender,  and  socioeconomic 
level). 

The  greatest  proportion  of  students  were  classified  as  proficient  on  the 
SOI,  with  values  in  the  reading  area  ranging  from  57%  to  71%  across  the  grade 
levels,  values  in  the  mathematics  area  ranging  from  54%  to  72%,  and  values  in 
the  language  area  ranging  from  52%  to  74%  (see  Tables  21-23).  The 
percentages  of  students  classified  as  proficient  in  the  corresponding  subject 
areas  on  the  Stanford  were  considerably  lower,  ranging  from  31 .2%  to  37.2%  in 


143 


reading,  39.7%  to  48.7%  in  mathematics,  and  30.2%  to  37.8%  in  language  arts. 
The  discrepancies  among  measurement  methods  becomes  even  greater  when 
performance  on  the  Florida  Writing  Assessment  is  considered.  The  percentage 
of  grade  4 students  rated  as  proficient  on  the  SOI  is  67.6,  on  the  SAT  37.8,  and 
22.5  on  the  Florida  Writing  Assessment. 

At  least  three  possible  explanations  arise.  One  is  that  the  district  and  the 
state  Title  I program  office  have  different  standards  for  "Proficiency."  Another 
possible  explanation  is  that  teachers  see  greater  evidence  of  proficiency  in 
portfolio  artifacts  than  children  are  able  to  display  on  more  objective  forms  of 
assessment.  A third  possibility  is  that  the  different  standard-setting  methods 
used  for  the  different  measurement  methods  may  lead  to  discrepancies  in 
proportions  classified  as  proficient. 

Internal  Consistency  of  the  Portfolio-Based  Teacher  Ratings  of 
Student  Performance 

One  important  aspect  of  construct  validity  identified  by  Messick  (1989, 
1995)  includes  the  extent  to  which  score  properties  generalize  across 
population  groups.  To  determine  the  internal  consistency  of  scored  features  of 
the  SOI  for  various  subgroups,  internal  consistency  was  estimated  for  each  of 
the  three  subject  areas  for  the  SOI.  Values  of  the  generalizability  coefficient 
were  generally  high  across  grade  levels  and  subgroups.  In  the  reading 
domain,  values  for  total  grades  ranged  from  .91 -.93,  for  race  subpopulations 
from  .84-. 94,  for  gender  subpopulations  from  .91 -.94,  and  for  socioeconomic 
subpopulations  from  .90-. 95.  In  the  math  domain,  values  for  each  of  the  grade 
levels  were  .96,  for  race  subpopulations  values  ranged  from  .93-. 97,  for  gender 
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subpopulations  values  ranged  from  .95-. 97,  and  for  socioeconomic 
subpopulations  values  ranged  from  .95-. 97.  In  the  language  domain,  values  for 
total  grades  ranged  from  .92-. 94,  for  race  subpopulations  from  .90-. 96,  for 
gender  subpopulations  from  .92-. 94,  and  for  socioeconomic  subpopulations 
from  .84-.95. 

In  general,  values  of  the  generalizability  coefficients  tended  to  be  highest 

in  the  mathematics  area  and  lowest  in  the  language  area.  These  results  seem 

to  indicate  that  internal  consistency  of  scores  is  not  impacted  by  subpopulation 

membership.  Such  high  estimates  of  internal  consistency  may  indicate  that  the 

scoring  criteria  are  truly  representative  of  the  knowledge  domain  of  interest. 

Another  possibility,  however,  is  that  halo  effects  influence  teacher  judgment 

across  all  scored  features  of  the  portfolio.  Finally,  these  coefficients  are 

overestimates  of  phi  coefficients  that  would  be  associated  with  reliability  of 

scores  used  for  making  absolute  decisions  relative  to  a specific  cut  score. 

Absolute  and  Relative  Standard  Errors  of  Measurement  of  the  Portfolio-Based 
Teacher  Ratings  of  Student  Performance 

In  order  to  determine  if  subgroup  membership  impacted  the  values  of 
absolute  and  relative  standard  errors  of  measurement  for  the  SOI,  these  were 
estimated  and  compared  across  groups.  As  would  be  expected,  standard 
errors  for  absolute  decisions  were  larger  than  for  relative  decisions  and  tended 
to  be  fairly  uniform  within  each  subject  area  across  subgroups.  Values  tended 
to  be  lowest  in  math  area,  with  standard  errors  of  measurement  for  relative 
decisions  falling  in  the  .09-.  12  range  and  for  absolute  decisions  in  the  .09-.  13 
range.  Values  tended  to  be  highest  in  the  reading  domain,  with  standard  errors 
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of  measurement  for  relative  decisions  falling  in  the  .15-. 22  range,  and  for 
absolute  decisions  in  the  .16-. 22  range.  These  findings  imply  that  confidence 
intervals  for  a given  student’s  true  score  would  be  smallest  for  relative  decisions 
in  the  math  area  and  largest  for  absolute  decisions  in  the  reading  area.  It 
should  be  noted  that  the  number  of  items  affects  these  estimates,  with  math 
having  the  most  items  and  reading,  the  fewest.  The  similarity  of  these  errors  for 
various  groups  of  examinees  offers  no  evidence  to  indicate  that  errors  of 
measurement  are  more  prevalent  for  some  groups  than  others. 

Proficiency  Classification 

A cross-tabulation  of  the  number  and  percentage  of  students  in  each  of 
the  subgroups  meeting  the  proficiency  criteria  was  conducted  for  each  pair  of 
measurement  methods.  Cohen’s  Kappa  was  calculated  as  an  estimate  of  the 
decision  consistency  across  measurement  methods,  beyond  that  level  expected 
by  chance.  The  consistency  of  proportions  of  students  classified  as  proficient  by 
pairs  of  methods  varied  substantially  across  groups. 

The  proportion  of  students  who  could  be  classified  consistently  as 
proficient  across  the  pairs  measurement  methods  was  lowered  by  the 
differential  proficiency  classification  rates  of  each  of  the  assessments.  While  all 
the  Kappa  values  were  relatively  low  possibly  due  to  the  different  points  on  the 
ability  distribution  where  the  standards  were  set  on  the  instruments  or  due  to  the 
relatively  low  convergent  coefficients,  there  were  some  notable  discrepancies  in 
Kappa  for  some  subgroups.  In  Language,  for  example,  at  grade  3,  Kappa  for 
SAT  and  SOI  was  .10  for  Hispanics  but  .38  and  .37  for  blacks  and  whites.  At 
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this  grade  more  than  half  of  the  Hispanic  students  (56.4%)  were  classified  as 
proficient  on  the  SOI,  but  only  14.5%  were  classified  as  proficient  on  the  SAT. 
Convergent  and  Discriminant  Validity  Coefficients 

Performance  on  the  SOI  was  positively  related  to  performance  on  the 
SAT  and  the  Florida  Writing  Assessment.  For  each  grade  level  as  a whole, 
convergent  validity  coefficients  tended  to  be  only  moderate,  with  the  strongest 
relationship  noted  in  the  reading  domain  at  grade  3 (r=.72)  and  the  weakest  in 
math  at  grade  5 (r=.47).  Using  Steiger’s  modified  z*,  the  significance  of  the 
convergent  validity  coefficients  was  tested  in  relation  to  the  discriminant 
coefficients  (Tables  34-38).  Although  patterns  of  the  relationship  among  the 
coefficients  were  not  clearly  defined,  some  noteworthy  results  were  found. 

For  the  grade  level  data,  convergent  validity  coefficients  for  the  SOI 
reading  domain  were  significantly  different  from  the  discriminant  validity 
coefficients  more  often  than  for  the  other  academic  areas.  It  is  also  interesting 
to  note  that  mono-method  heterotrait  coefficients  for  the  SAT  frequently 
exceeded  the  SAT-SOI  convergent  validity  coefficients.  Coefficients  based 
upon  data  adjusted  for  class  mean  level  of  achievement  on  each  of  these 
assessments  yielded  similar  findings. 

Together,  these  nondistinct  patterns  of  correlations  suggested  the 
possibility  of  a single  factor  dominating  performance  across  academic  as  well 
as  the  possibility  of  a measurement  factor  contributing  to  performance.  An 
exploratory  factor  analysis  was  conducted  in  order  to  further  explore  the 
construct  validity  of  the  SOI.  The  presence  of  a one-factor  model  was 
supported.  However,  results  for  the  two-factor  model  were  also  presented.  If 
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the  presence  of  measurement-related  factors  could  be  confirmed  in  future 
studies  with  larger  samples,  this  would  have  important  implications  for  design  of 
Title  I program  evaluations. 

Limitations 

This  study  raises  issues  about  the  use  of  a single  measurement  method 
in  evaluating  Title  I programs  and  the  impact  of  varying  assessment  methods  on 
evaluation  findings;  however,  the  sample  included  students  from  only  one 
school  district  and  encompassed  three  measurement  methods.  More  extensive 
studies  involving  additional  districts  and  forms  of  these  measurement  methods 
would  provide  for  more  generalizable  conclusions. 

Although  the  performance  portfolio  upon  which  the  Student  Outcomes 
Instrument  is  based  contains  student  work  samples  in  the  areas  of  reading, 
writing/language,  and  mathematics,  the  level  of  specificity  regarding  data 
collection  is  far  more  advanced  in  the  first  two  academic  areas  than  in  the  latter 
area.  If  this  research  study  were  to  be  replicated  using  the  same 
portfolio-based  system,  it  would  be  important  to  enhance  the  specificity  of  the 
mathematics  portion  of  the  student  performance  portfolio.  Furthermore,  the 
portfolio  system  in  this  study  was  designed  for  use  by  classroom  teachers,  not 
by  external  reviewers.  When  external  ratings  were  compared  with  teacher 
ratings  for  a sample  of  50  portfolios,  the  interrater  reliability  was  weak, 
particularly  in  the  mathematics  area,  indicating  that  the  scores  based  on 
contents  of  the  portfolio  and  instructions  for  scoring  are  not  generalizable  to 
other  raters  outside  the  classroom. 
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The  scoring  rule  used  to  define  proficiency  on  the  SOI  may  also  have 
affected  the  results.  Specifically,  the  appropriateness  of  judging  a student's 
performance  as  proficient  based  on  a bare  majority  of  scored  features  may  be 
questionable.  Additional  attention  to  standard-setting  on  the  SOI  seems 
warranted. 

Implications  for  Further  Research 

This  type  of  study  should  be  replicated  in  settings  that  include  other  forms 
of  standardized  norm-referenced  tests,  portfolio  systems,  and 
performance-based  assessments.  If  evaluation  findings  vary  across  these 
measurement  methods,  this  could  have  important  implications  in  terms  of  the 
legal  issues  of  equity  and  equal  access  to  services,  Title  I resource  allocations, 
instructional  strategies,  program  improvement,  student  selection  into 
supplemental  Title  I programs,  and  the  way  in  which  Title  I programs  are  to  be 
evaluated  at  school  and  district  levels.  Factor  analytic  studies  of  items  within 
each  domain  might  yield  further  insight  into  the  construct  validity  of  portfolio 
assessments. 

Many  states  are  beginning  to  implement  statewide  performance-based 
and  portfolio-based  measurement  systems.  Replicating  this  study  using  such 
data  should  also  provide  illuminating  results.  This  study  provides  a model  for 
Title  I and  other  program  evaluators  to  use  in  assessing  the  consistency  of 
evaluation  findings  across  measurement  methods.  Although  the  specific 
measurement  methods  employed  in  this  study  may  differ  from  those  used  in 
other  districts,  similar  analyses  to  those  conducted  in  this  study  should  provide 
evidence  of  the  consistency  of  evaluation  findings  and  their  impact  on  the 
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assessment  methods  chosen  by  districts  and/or  programs.  Such  information 
could  inform  the  design  of  effective  and  equitable  program  evaluation  models. 

Implications  for  Practice 

Three  specific  implications  for  practice  that  emerge  from  this  study  are  as 
follows: 

1 . If  classroom  portfolio  systems  are  to  be  used  as  alternatives  to 
standardized  tests  or  performance  assessments  in  program  evaluations,  it  is 
important  to  have  consistent  standards  for  defining  levels  of  proficiency  across 
measurement  methods. 

2.  At  least  some  portion  of  classroom  portfolio  assessments  should  be 
rated  by  external  raters  besides  the  classroom  teacher  to  insure  objectivity  and 
credibility  for  the  data. 

3.  Because  the  different  assessment  methods  yield  different  results, 
districts  should  exercise  caution  in  relying  on  different  forms  of  assessments  at 
different  grade  levels. 


APPENDIX  A 

SUMMARIES  OF  EMPIRICAL  STUDIES  PRESENTED  IN  CHAPTER  2 

Baxter,  Shavelson,  Goldman,  and  Pine  (1992)  evaluated  a procedure- 
based  scoring  system  for  a science  performance  assessment  and  compared 
student  performance  on  an  observed  science  investigation  and  a traditional 
multiple  choice  norm-referenced  test  (MC-NRT).  In  the  “paper  towel 
experiment,”  a total  of  96  fifth-grade  students  were  asked  to  determine  which  of 
three  paper  towels  held  the  most  water  and  to  record  in  a notebook  a 
description  of  the  investigation.  There  were  two  groups  of  students,  41  who 
were  experienced  in  hands-on  science  (ES)  and  55  who  were  inexperienced  in 
hands-on  science  (IS)  who  were  accustomed  to  receiving  science  instruction 
via  a traditional  textbook  approach.  As  a benchmark  by  which  to  evaluate  the 
performance  assessment,  all  students  were  administered  the  science  portion  of 
the  Comprehensive  Test  of  Basic  Skills  (CTBS).  In  addition,  students  were  also 
administered  the  Cognitive  Ability  Test  (CAT)  as  a measure  of  general 
intellectual  ability. 

A scoring  system  in  the  form  of  a flow  chart  was  established  to  evaluate 
the  steps  taken  by  the  students  in  coming  to  a conclusion.  Students  were 
required  to  complete  the  following  steps:  (a)  choose  a method  for  getting  the 
towels  wet  (i.e.,  put  each  towel  into  a container  of  water  or  use  an  eye  dropper 
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and  drip  water  on  each  towel  or  wipe  up  water  spilled  on  a tray),  (b)  decide 
whether  or  not  to  completely  saturate  the  towels,  and  (c)  choose  a method  to 
determine  the  results  (i.e.,  weighing  each  towel).  In  order  to  score  all  of  these 
procedural  sequences  on  a common  metric,  the  various  steps  were  rank 
ordered  as  to  their  scientific  soundness  and  corresponding  grades  were 
assigned. 

Students  were  observed  by  one  of  four  pairs  of  trained  observers.  While 
the  student  performed  the  investigation,  each  of  the  observers  independently 
noted  the  procedures  used  and  assigned  a letter  grade.  After  completing  the 
investigation,  the  students  completed  a notebook  describing  the  investigation  in 
such  a way  that,  “a  friend  could  replicate  it.”  After  6 months,  each  notebook  was 
independently  scored  with  the  same  scoring  system  used  in  the  hands-on 
observation.  The  procedures  were  evaluated  and  each  observer  assigned  a 
letter  grade  to  the  notebook. 

In  order  to  investigate  the  reliability  (generalizability)  of  scoring  hands-on 
performance  and  notebooks,  Baxter  et  al.  conducted  a person-by-observer 
generalizability  study  for  each  of  the  four  observer  pairs  and  studied  the  degree 
to  which  observers  of  hands-on  performance  or  raters  of  notebooks  agreed  on 
the  procedures  used  by  the  student  and  the  extent  to  which  the  notebooks  could 
serve  as  a substitute  to  hands-on  performance.  Generalizability  of  scores 
based  on  observed  hands-on  performance  was  .96  for  the  total  sample  and  did 
not  vary  by  students’  experience  level  in  hands-on  science.  Generalizability  of 
scores  based  on  notebooks  was  not  as  high  and  did  vary  with  student  ability,  as 
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measured  by  the  Cognitive  Ability  Test.  Generalizability  for  the  total  sample  on 
notebooks  was  .85,  for  the  ES  group  .91 , and  for  the  IS  group  .82. 

The  authors  went  on  to  explore  the  validity  of  the  procedures  by 
examining  the  performance  of  ES  and  IS  groups  on  multiple-choice  and 
performance  measures,  while  considering  student  ability.  While  ES  students 
performed  better  than  IS  students  on  the  CTBS,  the  hands-on  observation,  and 
the  notebook,  it  would  appear  that  ability  is  confounded  with  experience. 
However,  by  using  analysis  of  covariance  (ANCOVA),  with  the  CAT  as  the 
covariate,  the  authors  were  able  to  report  both  adjusted  and  unadjusted  means 
for  the  two  groups.  Ability  correlated  less  with  the  performance  measures  (both 
observed  and  notebooks)  than  with  the  CTBS  multiple-choice  test,  roughly  .46 
compared  with  .71  in  the  total  sample.  The  authors  stated  that  “for  the  total 
sample,  performance  measures  drew  less  on  traditional  cognitive  abilities  than 
did  multiple-choice  tests,  perhaps  because  of  the  concrete  nature  of  the  former 
and  the  broad,  abstract  nature  of  the  latter”  (p.  11). 

Baxter  et  al.  then  went  on  to  investigate  the  correlations  among  the 
multiple-choice  and  performance  scores  for  the  total  group,which  fell  in  the 
.40-. 47  range,  indicating  that  the  different  measures  tapped  different  aspects  of 
science  achievement.  Because  the  notebook  method  involved  decreased 
dollar  and  time  costs  than  the  observation  method,  they  concluded  that  the 
notebook  was  a viable  alternative  to  observation.  In  a related  article, 
Shavelson,  Baxter,  and  Pine  (1992)  suggested  that  because  students  could 
perform  well  on  one  task  and  poorly  on  another,  a sizeable  number  of  tasks 
would  be  needed  in  order  to  achieve  the  desired  reliability. 
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Burger  and  Burger  (1994)  examined  the  consistency  of  results  obtained 
from  performance  measures  and  norm-referenced  tests.  Specifically,  the 
performance  of  sixth-grade  students  on  the  Michigan  Educational  Assessment 
Program  (MEAP)  Essential  Skills  Reading  Test  and  a locally  developed  writing 
assessment  were  compared  to  performance  on  relevant  subtest  sections  of  the 
Comprehensive  Test  of  Basic  Skills,  4th  Edition  (CTBS-4).  Both  the  MEAP  and 
the  writing  assessment  were  paper-and-pencil  measures  that  fit  the 
characteristics  of  performance  assessment  outlined  by  the  Center  for  Research 
on  Evaluation,  Standards,  and  Student  Testing.  The  emphasis  in  these 
assessments  was  on  both  process  and  product.  The  writing  assessment  had 
the  following  features: 

Sequential  coordination  from  grade  to  grade,  face  validity  in  terms 
of  professional  judgment,  content  validity  in  terms  of  an  empirical 
analysis  of  student  writing,  and  consistency  with  contemporary 
scholarship  in  learning  theory,  psycholinguistics,  and  language 
development,  (p.  10) 

Each  student  wrote  a first-draft  essay  for  two  out  of  three  prompts.  Together,  the 
student  and  teacher  selected  what  they  believed  to  be  the  best  essay,  which 
was  foHA/arded  for  scoring  by  a minimum  of  two  trained  raters.  If  a paper 
received  two  different  scores  that  could  not  be  resolved,  a third  rater  was  used. 
The  writing  assessment  was  holistically  scored  using  a 7-point  rating  scale. 

The  Essential  Skills  Reading  Test  (MEAP)  was  designed  to  measure  the 
attainment  of  the  goals  of  the  district’s  reading  curriculum.  The  assessment 
contains  excerpts  of  stories  or  chapters  which  are  representative  of  materials 
which  students  were  likely  to  encounter  in  their  daily  reading  along  with 
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accompanying  multiple-choice  items  designed  to  measure  how  well  they  are 
able  to  construct  meaning  from  what  they  have  read.  Finally,  the  CTBS-4  is  a 
norm-referenced  test  designed  to  measure  achievement  in  basic  skills  that  were 
taught  throughout  the  nation.  Subtests  included  in  the  analysis  were  language 
mechanics,  language  expression,  total  language,  reading  vocabulary,  reading 
comprehension,  and  total  reading. 

A multivariable  (writing  and  reading)  by  multimethod  (performance  and 
norm-referenced)  analysis  was  applied  to  the  sixth-grade  assessment  data. 
Intertest  correlations  and  test  reliabilities  for  642  sixth-grade  students  with  data 
on  all  three  assessments  were  estimated.  Test  reliabilities  for  the  CTBS  and  the 
MEAP  were  in  the  .82  to  .95  range.  Interrater  reliability  for  the  writing 
assessment  was  .69. 

In  terms  of  criterion-related  validity,  the  correlations  between  the  writing 
performance  measure  and  the  language  portions  of  the  CTBS  were  in  the  .39  to 
.48  range,  which  were  significant  at  the  .001  level.  For  the  reading  performance 
measure  and  the  reading  portions  of  the  CTBS,  the  correlations  were  in  the  .55 
to  .71  range,  also  significant  at  the  .001  level. 

As  the  authors  pointed  out,  for  a satisfactory  validation  process,  intertest 
correlations  should  also  be  greater  than  the  coefficients  for  different  variables 
measured  by  the  same  method  and  greater  than  those  for  different  variables 
measured  by  different  methods  (p.  12).  That  is,  reading  and  writing 
performance  measures  and  reading  and  language  CTBS  subtests  should  have 
a lower  correlation  than  the  correlation  of  the  reading  performance  measure 
with  the  reading  subtests  of  the  CTBS  and  the  writing  performance  measure 
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with  the  language  subtests  of  the  CTBS.  This  condition  was  only  partially  met 
in  this  study  in  that  the  validity  coefficients  for  the  writing  assessment  were 
smaller  than  the  correlations  between  reading  and  language  scores  on  the 
CTBS-4  (.58  to  .81).  The  coefficients  for  both  performance  measures  were 
slightly  higher  than  the  correlations  between  different  variables  as  measured  by 
different  methods  (r=  .40  to  .62).  The  authors  suggested  that  one  explanation 
for  this  may  be  the  due  in  part  to  the  stringent,  empirical  development  of  the 
CTBS-4.  They  concluded  that  performance  assessments  may  provide  different 
kinds  of  information  about  student  abilities  than  do  normative  assessments. 

Buttram  and  McCann  (1993)  also  investigated  the  relationship  of 
standardized  norm-referenced  achievement  test  scores  to  other  indicators  of 
student  achievement.  Their  study,  however,  occurred  in  the  context  of  Title  I 
evaluation  and  the  prescribed  method  at  the  time,  which  involved  the  NCE  gain. 
Specifically,  they  examined  the  relationship  between  NCE  gains  in  the  areas  of 
reading  and  mathematics  to  the  percentage  of  students  with  higher  posttest 
scores  than  pretest  scores  in  reading  and  mathematics,  reading  and 
mathematics  teacher-assigned  report  card  marks,  instructional  reading  level  as 
measured  by  basal  level,  and  an  exit  indicator  as  to  whether  or  not  the  student 
would  remain  eligible  for  Title  I services  during  the  following  school  year,  which 
was  based  on  absolute  performance  on  the  NRT  (i.e.,  40th  percentile  for  grades 
K-4  and  25th  percentile  for  grades  5 and  above).  School-level  data  for  the  four 
elementary  schools  gathered  on  the  eight  different  indicators  over  5 school 
years  were  analyzed.  In  the  first  set  of  analyses,  gains  (or  losses)  on  each  of 
the  eight  indicators  for  the  four  schools  over  the  5 years  were  calculated  (e.g.. 
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mean  NCE  gain  in  reading  and  mathematics  was  calculated  for  each  school 
from  one  year  to  the  next  and  change  in  the  percentage  of  students  receiving 
As,  Bs,  and  Cs  on  their  report  cards  in  reading  and  mathematics). 

In  the  second  set  of  analyses,  aggregated  NCE  gains  (or  losses) 
exceeding  three  and  percentage  gains  (or  losses)  exceeding  five  were  coded 
as  significant.  In  the  third  set  of  analyses,  NCE  gains  on  the  NRT  measure  and 
gains  obtained  on  each  of  the  other  reading  or  mathematics  indicators  were 
compared  to  determine  if  the  different  measures  produced  similar  results. 
Because  NRT  reading  and  mathematics  NCE  gains  were  the  most  commonly 
used  indicators  for  measuring  progress  in  Title  I programs  at  the  time  of  the 
study,  they  were  chosen  as  the  standard  for  comparison.  They  found  that  none 
of  the  reading  or  mathematics  indicators  consistently  produced  findings  similar 
to  the  NRT  NCE  gains.  Great  variation  on  the  indicators  was  found  from  school 
to  school  and  from  year  to  year  (e.g.,  changes  in  the  percentage  of  Title  I 
students  receiving  As,Bs,  or  Cs  in  reading  or  mathematics  both  produced  the 
greatest  number  of  matches  to  NRT  NCE  gain  at  41.7%,  while  the  exit  indicator 
(i.e.,  proportion  eligible  for  Title  I services)  produced  the  least  number  of 
matches  to  the  NRT  NCE  gain  at  31 .3%).  In  addition,  the  authors  found  that 
schools  frequently  demonstrated  significant  increases  on  one  of  the  indicators 
while  producing  significant  decreases  on  the  other.  From  these  results,  the 
researchers  concluded  that  much  more  discussion  and  solid  evidence  about 
valid  and  reliable  approaches  to  the  evaluation  of  Title  I program  effectiveness 


is  needed. 
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Gentile  (1992)  reported  on  a 1990  NAEP  study  which  compared  grades 
4 and  8 student  performance  on  assigned  tasks  in  the  NAEP  direct  writing 
assessment  with  student  performance  measured  by  a sample  of  school-based 
writing.  Approximately  36%  of  the  time,  students  received  significantly  different 
scores  on  their  NAEP  writing  assessment  than  those  they  received  on  their 
school-based  writing,  with  discrepancies  more  evident  at  grade  8.  About 
two-thirds  of  these  students  performed  better  on  the  NAEP  assessment.  Gentile 
hypothesized  that  the  reason  for  this  differential  performance  could  be  that  “the 
different  procedures  and  features  of  the  methods  of  assessment  may  result  in  a 
sampling  of  different  aspects  of  students’  writing  abilities”  (p.  66). 

Koretz  et  al.  (1993)  investigated  performance  of  grades  4 and  8 students 
on  the  state  portfolio  assessment  in  math  and  writing.  They  found  that  total 
scores  within  each  subject  area  produced  relatively  higher  reliability  of  scoring, 
much  higher  than  for  scores  based  on  individual  pieces  found  in  the  portfolio. 
Furthermore,  scores  in  the  math  area  were  found  to  be  considerably  higher  than 
for  the  writing  area. 

Shepard  et  al.  (1996)  investigated  the  impact  of  the  use  of  performance 
assessments  on  instruction  and  student  learning.  Using  pre-  and  posttest 
reading  and  mathematics  data  from  the  California  Test  of  Basic  Skills  (CTBS) 
as  a covariate  and  selected  items  from  the  Maryland  State  Performance 
Assessment  and  an  alternative  mathematics  assessment  as  the  outcome 
measures,  the  performance  of  524  third-grade  students  versus  a cohort  group 
from  the  previous  year  and  a control  group  of  students  from  matched  schools 
was  compared.  The  teachers  in  the  focus  group  participated  in  year-long 
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training  in  the  development  of  classroom  performance  assessments  that  would 
be  congruent  with  their  instructional  goals.  Throughout  the  school  year, 
teachers  used  performance  assessments  such  as  running  records  and  written 
summaries  to  assess  reading  fluency  and  comprehension,  respectively.  In  the 
area  of  math,  teachers  increased  their  use  of  nonroutine  problems  which 
required  students  to  explain  their  solutions  or  to  analyze  and  explain  an 
incorrect  step  in  a problem,  as  well  as  the  use  of  math  manipulatives.  Interrater 
reliability  for  the  abbreviated  Maryland  State  Performance  Assessment  was 
estimated  both  within  year,  using  a slightly  more  than  10%  random  sample,  and 
between  years.  For  the  within-year  analysis,  Pearson  correlations  between 
raters  ranged  from  .96  to  .99.  To  check  consistency  of  raters  across  years,  57 
test  booklets  were  “seeded”  into  the  next  year’s  set,  yielding  a percentage  of 
agreement  within  .34  standard  deviations  of  72%  in  reading  and  79%  in  math. 

In  terms  of  the  impact  on  student  achievement,  while  no  statistically  significant 
differences  were  found  between  participating  and  control  schools  on  either  the 
Maryland  Performance  Assessment  or  the  alternative  math  assessment, 
Shepard  et  al.  found  qualitative  differences  in  achievement  levels  in  that 
students  at  participating  schools  were  able  to  perform  more  extended  problems 
and  justify  their  answers.  The  authors  concluded  that  the  significant 
contribution  of  performance  assessments  to  student  achievement  is  not  merely 
their  use  but  that  these  forms  of  assessment  can  be  a useful  tool  in  exposing 
students  to  relevant  problems  and  providing  them  with  opportunities  for 
learning. 
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Stevens  and  Clauser  (1995)  examined  concurrent  and  discriminant 
validity  of  six  Iowa  Test  of  Basic  Skills  (ITBS)  subtests  (vocabulary,  reading, 
spelling,  capitalization,  punctuation,  and  language  usage  and  expression)  at 
grade  3 and  a state  writing  performance  assessment  for  the  same  1 ,000 
students  a year  later  in  grade  4.  The  ITBS  is  composed  of  multiple-choice  or 
other  limited-response  item  formats.  The  writing  performance  assessment 
required  students  to  respond  to  a given  prompt  and  was  scored  holistically  by 
external  raters  using  rubrics  developed  based  upon  benchmarks  provided  by 
classroom  teachers.  Using  confirmatory  factor  analysis,  the  researchers 
evaluated  a series  of  models  that  represented  varying  hypotheses  regarding 
the  interrelationships  among  the  measured  variables.  They  concluded  that  the 
majority  of  the  variance  among  the  subtests  could  be  accounted  for  by  the 
effects  of  the  two  method  factors  and  that  “the  method  of  measurement  had  a 
great  deal  to  do  with  the  results  of  the  assessment,  regardless  of  what  was 
being  assessed”  (p.  10). 

Ercikan  and  Schwarz  (1995)  also  used  confirmatory  factor  analysis  to 
examine  how  the  dimensionality  of  item  types  (multiple-choice  and  constructed 
response)  varied  over  three  subject  areas  (reading,  math,  and  science)  for  third- 
grade  students.  The  assessments  included  the  Comprehensive  Test  of  Basic 
Skills  (CTBS)  subtests  of  Reading  Comprehension,  Math  Concepts  and 
Applications,  and  Science  and  a constructed-response  test  consisting  of  tasks 
designed  around  themes  in  the  areas  of  reading,  math,  and  science.  Testlets 
formed  on  the  basis  of  themes  were  used  to  reduce  the  data  matrix.  The 
researchers  tested  a one-factor  model  that  allowed  testlets  of  both  items  types 
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to  load  onto  a single  factor.  The  one-factor  model  was  compared  to  an 
alternative  two-factor  model  in  which  each  item  type  (e.g.,  multiple-choice  and 
constructed  response)  formed  a separate  factor.  Using  goodness  of  fit  indices 
such  as  chi-square/degrees  of  freedom  ratio,  the  nonnormed  fit  index  (NNFI), 
Akaike’s  information  criterion  (AlC),  and  the  hierarchical  chi-square  test,  the 
researchers  concluded  that  the  two-factor  model  was  preferable  to  the 
one-factor  model  for  all  three  content  areas.  This  would  suggest  that  the 
different  test  types  are  measuring  different  constructs  in  all  three  content  areas. 
However,  given  that  the  two  factors  were  highly  correlated,  ranging  from  .69-. 82, 
“the  two  item  types  were  closely  related,  even  though  they  may  not  be  identical” 
(p.  21).  A portion  of  the  analysis  focused  on  the  relationship  between  the  two 
item  types  for  low  and  high  ability  students.  Using  similar  methods,  the 
researchers  found  that  the  factor  structures  for  the  two  item  types  were  not 
similar  for  the  two  ability  groups.  One  possible  explanation  for  this  was  that  the 
low-ability  students  were  unable  to  respond  to  the  task  demands  for  the 
constructed  response  items.  These  findings  are  particularly  relevant  as  they 
pertain  to  issues  of  equity  in  the  assessment  of  students  and  the  evaluation  of 


programs. 
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Grade  3 Introductory  Paper 
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"Kepou 


3^uanas  ate  veiy  LntetesUn^  animaLs.  ^ke  caniCve  to 
be  6o  ifCA^s  ot  maybe  otdet.  ^keif  Ate  Lnytke  ’leytUe 
■banxUx^.  ^Agy  shed  about  th>Lce  a n\ontk.''''^keif  can  tun 
vetM  -hast  on  the  Land) and  it  is  katd  to  catck  them  i-b  they 
knve  escApeU.  ^key  ate  vetif  mte  to  knve  ns  pets,  but  Lt 
is  tke  new  ^kete  axe  ove’t  -fjo^tx^  syecLes  o-f; 

iguanas,  ^keif  Like  to  shiLm  in  Uvets  Lakes  and  yonds. 
YOken  a i^uAfta  is  attacked  tkey  shiat  tkeit  tail  to  de-^end 
tkenx  sei-kX5t  cHutts\.  ^keif  do  not  ckan^e  colot  Like 
some  otket  Lizaicis  do,  because  tkeif  blend,  in  H)itk  tke 
envi^onmeHt  so  t^ood  tkey  don’t  kave  to.  !) guanas  axe  a 
venf  LnteiestLn^  animal  to  study,  and  to  enjoy! 


Grade  5 Advanced  Paper 
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